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Linear  Predictive  coding  is  a popular  technique  for 
speech  analysis  and  synthesis.  But,  continuous  speech 
generated  from  a linear  predictive  (LP)  synthesizer  still 
lacks  naturalness.  The  objective  of  this  study  is  to 
examine  the  various  issues  involved  in  the  production  and 
evaluation  of  natural  sounding  synthetic  speech,  using  the 
linear  predictive  model.  In  particular,  a scheme  for 
obtaining  the  control  parameters  of  the  LP  synthesizer  very 
reliably  using  the  electroglottograph  (EGG)  signal  as  a 
glottal  sensor  is  proposed. 

Our  database  consists  of  speech  and  EGG  signals  for 
sentences  produced  by  male,  female  and  child  speakers.  The 
two  signals  are  obtained  time  synchronously.  The  features 
of  the  EGG  signal  and  their  relation  to  the  glottal 
vibratory  cycle  are  briefly  discussed.  Algorithms  for 

vi 


computing  the  fundamental  frequency  contour  and 
voiced/unvoiced  decision  from  the  EGG  signal  are  outlined. 
Pitch  synchronous  LP  analysis  and  synthesis  schemes,  guided 
by  the  EGG  signal,  are  discussed.  Excitation  signals  for 
synthesizing  voiced  speech  are  also  derived  from  the  EGG 
signal.  The  synthesized  sentences  are  evaluated  by  a total 
of  twenty  listeners  in  a formal  listening  test. 

The  results  show  that  synthesis  naturalness  can  be 
significantly  improved  with  the  use  of  the  EGG  signal. 
Errors  in  voicing/unvoicing  and  pitch  computation  are 
elimimated.  Pitch  synchronous  analysis-synthesis  performed 
over  one  whole  period  and  non-impulse  excitations  derived 
from  the  EGG  signal  result  in  large  improvements  to 
synthesis  naturalness. 
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CHAPTER  1 
INTRODUCTION 

1 . 1 Objective 

With  the  increasing  use  of  speech  for  mind-machine 
interaction,  the  generation  of  natural  sounding  synthetic 
speech  is  becoming  a requirement  for  successful  products  in 
the  market  place.  Computer  generated  speech  should  meet  our 
"standards"  of  intelligibility,  quality,  recognizability  and 
naturalness.  The  increasing  demands  placed  on  the  finer 
attributes  of  synthetic  speech  have  led  to  considerable 
research  to  understand  the  factors  involved  in  the 
reproduction  of  "natural  sounding"  synthetic  speech. 
Recently,  several  attempts  have  been  made  to  obtain  an 
objective  evaluation  of  synthetic  speech.  Although  there 
has  been  a proliferation  of  definitions,  techniques  and 
observations,  a consensus  on  the  automatic  production  and 
evaluation  of  natural  sounding  speech  is  yet  to  be 
reached.  From  this  standpoint,  the  objective  of  this  study 
is  two-fold:  1)  to  establish  factors  contributing  to  the 

perceptual  attributes  of  computer  generated  speech  in  the 
context  of  a given  model  of  speech  production  and  2)  to 
examine  the  methodology  of  subjective  and  objective 

evaluation  of  such  synthetic  speech. 
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1 . 2 The  Speech  Code 

Speech  is  the  primary  mode  of  human  communication.  The 
ability  to  code  and  transmit  information  vocally  seems  to  be 
unique  to  the  human  species.  Some  have  suggested  that 
humans  be  labeled  "Homo  Loquens"  — Man,  the  Talking  Animal 
(Fry,  1977),  in  recognition  of  this  remarkable  achievement. 

Speech  communication  consists  of  an  hierarchical  set  of 
processes  that  link  the  speaker  with  the  listener.  Abstract 
thought  is  first  converted  into  a language  form  understood 
by  the  listener,  which  is  then  translated  into  a set  of 
neuro-muscular  commands  that  activate  the  various  organs  of 
the  vocal  apparatus.  These  include  the  lungs,  the  vocal 
folds,  the  "articulators"  of  the  vocal  tract,  viz.,  tongue, 
teeth,  lips,  etc.  and  the  nasal  tract.  A stream  of  air 
passes  through  either  the  vibrating  vocal  folds  or  a 
constriction  along  the  vocal  tract  and  excites  the  resonant 
cavity  formed  by  the  mouth  and  the  resonators.  The 
resulting  pressure  variations  constitute  the  acoustic  speech 
signal.  This  signal  activates  the  hearing  apparatus  of  the 
listener,  converting  the  physical  signal  into  neural 
messages  that  are  translated  into  a language  form  and 
finally  into  abstract  thought  in  the  listener's  brain.  The 
speech  signal  conveys  to  the  listener  not  only  the  semantic 
content  of  the  message  but  also  information  about  the 
speaker  and  the  speaking  conditions. 
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The  impact  of  the  technological  progress  of  the  past 
few  centuries  on  speech  communication  has  been  to  extend  the 
range  and  reliability  of  transmission  of  the  speech 
signal.  Today  speech  is  being  employed  as  an  interface 
between  the  computer  and  its  human  users.  Generation  of  the 
speech  signal  by  the  computer  has  led  to  Automatic  Voice 
Response  systems  and  speech  aids  for  the  handicapped. 

We  must  ensure  that  this  "speech  interface"  be 
efficient  and  perceptually  acceptable.  This  means  that  the 
synthetic  speech  should  sound  as  much  like  human  speech  as 
possible.  Studies  conducted  on  the  acceptability  of 
synthetic  speech  (Pisoni,  1930,  Pisoni  et  al . , 1933) 
indicate  that  synthetic  speech  places  greater  demands  on  the 
perceptual  mechanism  than  natural  speech.  For  example,  the 
response  time  of  subjects  to  synthetic  speech  stimuli  was 
significantly  longer  than  that  for  natural  speech  stimuli. 

Research  in  the  understanding  of  speech  naturalness  has 
taken  the  approach  of  "Analysis  by  Synthesis."  The 
parameters  of  a speech  production  model  are  manipulated  in 
different  ways  in  order  to  determine  the  smallest  set  of 
parameters  that  preserve  the  naturalness  of  the  synthesized 
speech.  This  approach  provides  information  on  the  features 
of  the  speech  signal  that  should  be  accurately  reproduced  to 
preserve  naturalness,  while  at  the  same  time,  it  indicates 
whether  certain  essential  features  of  the  speech  signal  have 
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survived  the  modeling  process.  The  most  prevalent  approach 
to  evaluation  of  naturalness  has  been  subjective,  but  there 
is  an  increasing  interest  in  the  development  of  objective, 
quantitative  evaluation  techniques. 

In  this  study  we  have  attempted  to  analyze  the 
contributions  of  the  various  parameters  of  a particular 
speech  synthesis  scheme,  viz.,  the  Linear  Prediction  scheme, 
to  the  naturalness  of  synthetic  speech.  We  have  explored 
the  methodology  of  evaluating  naturalness  both  subjectively 
and  objectively. 

A study  of  speech  naturalness  can  be  broken  down  into 
two  steps. 

i)  Development  of  a parametric  model  for  speech 
production. 

ii)  Evaluation  of  speech  produced  by  such  a model, 
using  subjective  listening  tests  and  objective 
distance  measures  based  on  the  characteristics  of 
the  speech  signal. 

A typical  production-evaluation  scheme  is  shown  in  Figure 
1.1.  A natural  speech  signal  is  processed  to  derive  a set 
of  analysis  parameters,  which  in  turn  control  the 
synthesizer  that  produces  a synthetic  version  of  the 
original  signal.  Each  synthetic  stimulus  is  compared 
subjectively  with  its  original,  to  evaluate  a specific 
attribute,  such  as  intelligibility,  naturalness,  etc.  The 
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Natural 

Speech 


Figure  1.1.  Speech  production-evaluation  scheme. 
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synthesis  is  repeated  for  various  analysis  conditions  with 
the  resulting  synthesis  being  evaluated  each  time,  until  an 
acceptable  performance  is  achieved.  Subjective  evaluation 
is  time  consuming,  expensive  and  is  listener  dependent. 
Hence,  there  is  an  increasing  interest  in  obtaining 
quantitative  distance  measures  derived  directly  from  the 
speech  signal  for  an  objective  comparison  of  original  and 
synthetic  speech  stimuli.  The  latter  approach  faces  a 
number  of  difficulties  which  we  will  discuss  in  this 
study.  In  the  next  section,  the  different  methods  of  speech 
production  are  outlined,  followed  by  a brief  description  of 
the  speech  evaluation  schemes.  These  sections  provide  a 
rationale  for  the  research  problem  proposed  subsequently. 

1 . 3 Models  of  Speech  Production 
The  models  of  the  acoustic  production  of  the  speech 
signal  have  centered  upon  the  "Source-Filter"  concept  (Fant, 
1960).  The  speech  signal  is  the  output  of  a vocal  tract 
filter  which  is  excited  by  a vocal  source  function.  In 
speech  analysis,  we  are  interested  in  techniques  for 
deriving  the  parameters  of  the  speech  production  model  from 
the  natural  speech  signal.  In  synthesis,  our  objective  is 
to  generate  the  acoustic  speech  signal  by  controlling  the 
various  parameters  of  the  model. 
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1.3.1  Speech  Analysis 


The 

speech  analysis 

scheme , 

as  shown 

in  Figure  1.2, 

derives 

from  the  speech 

signal 

two  sets 

of  parameters: 

a)  vocal  tract  parameters  representing  the  spectrum  of  the 
vocal  tract  and  b)  excitation  parameters,  which  include 
periodicity,  voiced/unvoiced  decision  and  amplitude.  The 
spectral  parameters  can  be  either  formants  of  the  vocal 
tract  spectrum  or  the  linear  prediction  coefficients.  The 
speech  signal  is  low  pass  filtered,  sampled  and  digitized. 
Pre-processing  may  involve  preemphasis  before  linear 
prediction  analysis  or  further  low  pass  filtering  preceding 
pitch  estimation.  The  analysis  procedure  will  be  discussed 
in  detail  in  Chapters  3 and  4. 

1.3.2  Speech  Synthesis 

The  aim  of  speech  synthesis  is  to  reproduce  the 
acoustic  speech  signal  by  controlling  the  various  parameters 
of  the  speech  production  model. 

The  history  of  speech  synthesis  goes  back  to  the 
eighteenth  century  with  the  acoustic  resonators  of 
Kranzenstein  and  Von  Kempelen's  speaking  machine  (Flanagan, 
1972).  In  recent  times,  synthesizers  have  been  electrical 
in  nature.  The  vocoder  of  H.  Dudley  was  one  of  the  most 
notable  electrical  synthesizers  in  the  early  part  of  this 
century  (Schroeder,  1966).  Progress  in  electronics  and 
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Figure  1.2.  Speech  analysis  scheme. 
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digital  computers  brought  forth  several  types  of  vocoders 
which  used  control  parameters  derived  from  an  automatic 
analysis  of  the  speech  signal  (Flanagan,  1972).  Speech 
synthesizers  driven  by  phonetic  rules  or  hand  generated 
control  parameters  have  also  been  developed  (Holmes  et  al., 
1964,  Rabiner,  1968).  Other  synthesizers  have  attempted  to 
reproduce  speech  from  a model  of  the  articulatory  mechanism 
rather  than  its  temporal  or  spectral  structure.  The 
synthesizer  of  Flanagan  (Flanagan  et  al. , 1975)  which  is 
based  on  a dynamic  model  of  the  vocal  folds  and  a vocal 
tract  analog  and  Coker's  articulatory  synthesizer 
(Coker, 1968)  are  prime  examples  of  this  class. 

Relevent  to  this  study,  there  are  basically  three  types 
of  synthesis  schemes,  each  differing  in  its  specific  control 
parameters,  but  all  aimed  at  producing  natural  sounding 
speech.  These  synthesizers  are  discussed  below. 

1.3. 2.1  Formant  Synthesizer 

This  synthesis  technique  is  based  on  the  source- f i Iter 
model  of  speech  production.  The  frequency  response  of  the 
filter  is  characterized  by  peaks  referred  to  as  "Formants" 
and  by  valleys,  which  are  controlled  by  the  use  of  resonant 
and  anti-resonant  circuits,  whose  center  frequency  and 
bandwidth  can  be  individually  specified.  The  resonant 
circuits  are  connected  either  in  series  or  in  parallel, 
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facilitating  the  production  of  both  nasal  and  non-nasal 
sounds.  An  excitation  source  resembling  the  natural 
excitation  is  used  with  its  amplitude  and  frequency 
specified  individually.  A cascade  and  parallel  formant 
synthesizer  (Klatt,  1980)  is  illustrated  in  Figure  1.3. 

The  ability  to  individually  vary  the  control  parameters 
makes  this  synthesizer  very  useful  for  phonetic  and 
perceptual  research.  The  behavior  of  formants  for  various 
speech  sounds  has  been  extensively  studied.  The  formants 
can  be  estimated  via  spectograph  displays  or  numerical 
techniques  based  on  linear  prediction  analysis  discussed  in 
Chapter  3 (McCandless,  1974). 

The  synthesis  schemes  of  Klatt  (1980)  and  Holmes  (1979) 
have  produced  speech  of  high  quality.  But,  precise 
estimation  of  formants  is  difficult  and  it  takes 
considerable  skill  to  obtain  and  manipulate  the  control 
parameters . 

1.3. 2. 2 Linear  Prediction  Synthesizer 

The  motivation  for  the  linear  prediction  synthesizer  is 
rooted  in  a basic  model  of  speech  production,  viz,  that 
speech  is  the  output  of  a linear  time  varying  system  excited 
by  either  periodic  or  random  noise  excitation  (Atal  and 
Hanauer,  1971;  Markel  and  Gray,  1976).  The  linear  system  is 
modeled  as  an  all-pole  (Autoregressive)  filter.  This  means 
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that  each  speech  sample  can  be  approximated  by  a linear 
combination  of  some  past  samples.  By  minimizing  the  sum  of 
the  squared  differences  between  the  actual  and  the  linearly 
predicted  samples  over  a suitably  chosen  interval,  a unique 
set  of  predictor  coefficients,  which  identify  the  linear 
system,  can  be  obtained.  In  addition  to  the  predictor 
coefficients  which  can  be  related  to  the  vocal  tract 
spectrum,  other  speech  parameters  such  as  fundamental 
frequency,  energy,  voicing/unvoicing  decision  can  also  be 
determined  from  the  linear  prediction  analysis.  Speech 
synthesis  is  accomplished  by  deriving  the  all  pole  filter  by 
a suitably  chosen  excitation  function  as  shown  in 
Figure  1.4. 

The  human  hearing  mechanism  is  believed  to  be  more 
sensitive  to  the  short  term  magnitude  spectrum  of  a speech 
signal  than  to  its  phase  spectrum.  The  ability  of  the 
linear  prediction  scheme  to  efficiently  represent  the 
envelope  of  the  short  term  magnitude  spectrum  is  the  main 
reason  for  its  success.  The  relative  ease  of  analysis  and 
the  flexibility  of  parameter  manipulation  have  made  this 
scheme  one  of  the  most  popular  synthesis  schemes.  Further, 
since  speech  segments  can  be  represented  by  a small  set  of 
predictor  coefficients  and  excitation  parameters,  this 
method  is  eminently  suited  for  data  compression,  low  bit 
rate  transmission  and  efficient  storage. 


Pitch 


13 


si 

o 


Figure  1.4.  A linear  prediction  synthesizer. 
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1.3. 2. 3 Articulatory  Synthesizer 

This  scheme  is  based  directly  on  the  structure  and 
movements  of  the  articulatory  mechanism,  which  includes  the 
tongue,  lips,  mandible,  etc.  In  one  approach,  due  to  Coker 
(1968),  each  speech  sound  is  characterized  by  a target 
configuration  of  the  articulators  and  the  dynamics  of  the 
vocal  tract  movement  are  simulated  by  the  individual  control 
of  the  articulator  positions.  The  resulting  vocal  tract 
area  functions  are  transformed  into  formant  freguency  data 
which  control  a formant  synthesizer  to  generate  synthetic 
speech.  Another  approach  proposed  by  Flanagan  et  al., 
(1975)  has  an  integrated  laryngeal  model  and  a vocal  tract 
analog.  This  provides  for  a physiological  description  of 
the  speech  signal  in  terms  of  vocal  fold  tension,  subglottal 
pressure,  vocal  tract  shape  and  losses  due  to  vocal  tract 
wall  vibration. 

The  articulatory  synthesizers  are  directly  related  to 
the  physiological  parameters  of  speech  production.  They  are 
an  invaluable  tool  for  a systematic  investigation  of  the 
contributions  of  each  physiological  parameter  to  the 
perceptual  attributes  of  the  acoustic  speech  waveform. 
Despite  these  strong  points,  the  articulatory  synthesizers 
have  not  been  as  popular  as  the  formant  synthesizer  or  the 
LP  synthesizer.  The  difficulty  in  obtaining  precise 
articulatory  data  and  heavy  computational  demands  are  their 
major  drawbacks. 
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1 . 4 Approaches  to  Speech  Evaluation 
The  performance  of  a speech  synthesizer  is  ultimately 
judged  by  its  ability  to  produce  natural  speech.  Relevant 
to  this  study,  there  are  three  specific  attributes  that 
enter  into  this  judgement,  viz.,  intelligibility,  quality 
and  naturalness.  The  speech  "attribute  space"  is  depicted 
in  Figure  1.5. 

1.4.1  Subjective  Evaluation 

The  prevalent  methods  of  evaluating  synthesizers  are 
largely  subjective.  Intelligibility  is  defined  as  the 
phonetic  identif iability  of  a speech  stimulus  such  as  a 
syllable,  word,  sentence,  etc.,  which  is  measured  by 
counting  the  number  of  speech  stimulus  units  correctly 
identified  by  a subject  group  (French  and  Steinberg, 
1947).  Quality  is  determined  from  a subjective  appraisal  of 
a speech  stimulus,  using  comparisons  between  a test  stimulus 
and  a reference  utterance  of  known  attributes  such  as 
"breathy,"  "crisp,"  "rough,"  etc.  (Rothauser  et  al.,  1971; 
Colton  and  Estill,  1981).  Several  rating  schemes  and  rank 
order  techniques  have  been  proposed  (Voiers,  1977)  but  their 
drawback  is  the  variability  of  the  scores  with  subject 
groups  and  test  material. 

Speech  "naturalness"  encompasses  the  total  auditory 
impression  of  speech  experienced  by  a human  listener. 
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Figure  1.5. 


The  speech  attribute  space. 
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Speaker  recognizability  is  suggested  as  one  aspect  of 
naturalness  (Flanagan,  1972).  But  naturalness  can  be 
perceived  without  any  relation  to  a specific  speaker, 
provided  the  speech  simply  sounds  "human."  A stimulus  can 
sound  "natural"  without  being  "intelligible,"  but  the 
acceptability  of  a speech  synthesizer  demands  high 
intelligibility  as  well  as  naturalness.  The  lack  of 
understanding  of  the  perception  of  the  speech  signal  is 
generally  reflected  in  the  nonavailability  of  even  a 
functional  description  of  naturalness.  Subjective 
evaluation  gives  a single  rating  in  perceptual  terms  and 
hence  does  not  directly  relate  to  the  speech  signal 
characteristics  (Barnwell,  1979).  Its  usefulness  in 
improving  a synthesizer  is  therefore  limited.  On  the  other 
hand,  an  objective  evaluation  procedure  is  reliable, 
invariant  with  testing  conditions  and  is  derived  directly 
from  the  speech  signal,  as  in  Figure  1.6. 


1.4.2  Objective  Evaluation 

Although  the  physical  correlates  of  quality  and 
naturalness  are  poorly  understood,  several  attempts  have 
been  made  to  design  an  objective  evaluation  procedure 
derived  from  the  speech  signal  (Barnwell,  1980,  Makhoul 
et  al.,  1976).  A distance  metric  is  formulated  based  on  a 
set  of  signal  features.  Two  speech  stimuli  are  compared  by 
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Time  Seg-  Critical  Harmonic  Cepstral  Log  Log 

rented  SNR  Band  SNR  Ratio  Distance  Area  Likelihood 

Ratio  Ratio 


Figure  1.6.  Some  commonly  used  objective 
measures . 


distance 
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extracting  the  same  set  of  features  from  each  and  applying  a 
distance  metric  to  the  two  feature  sets.  By  selecting  a 
perceptually  consistent  (Makhoul  et  al.,  1976)  set  of 
features,  one  hopes  to  achieve  a high  degree  of  correlation 
between  objective  distance  measures  and  subjective  ratings 
of  the  same  speech  stimuli  (Market  and  Gray,  1976).  Current 
research  in  objective  evaluation  of  speech  has  focused  on 
two  applications  as  in  Figure  1.7. 

The  first  study,  conducted  at  Georgia  Institute  of 
Technology  (Barnwell,  1980),  shown  in  Figure  1.7. a,  starts 
with  a large  data  base  of  undistorted  natural  speech,  from 
which  a distorted  database  is  generated  by  adding  controlled 
distortions  in  the  form  of  waveform  coders,  filtering, 
clipping,  etc.  A subjective  evaluation  is  then  performed  by 
presenting  the  distorted  and  undistorted  speech  to  listeners 
and  their  preference  scores  are  tabulated.  Several  distance 
measures  such  as  s ignal-to-noise  ratios,  spectral  distance 
measures,  etc.,  are  applied  to  the  two  data  bases  for  each 
type  of  distortion.  Then  the  subjective  and  objective 
scores  are  correlated,  to  determine  how  well  the  objective 
scores  can  predict  subjective  ratings.  This  scheme  cannot 
be  used  to  evaluate  the  loss  of  naturalness  in  a synthesizer 
because  there  is  no  representation  for  the  "noise"  that 
causes  unnaturalness.  The  specific  contributions  of  the 
synthesis  parameters  to  naturalness  are  not  individually 
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studied;  hence  no  direct  recommendations  for  improvements  in 
synthesis  schemes  are  provided.  The  second  approach  taken 
by  Bolt,  Beranek,  and  Newman,  Inc.  (Makhoul  et  al.,  1976  ) 
shown  in  Figure  1.7.b,  addresses  the  evaluation  of  vocoded 
speech.  A framework  for  such  an  evaluation  was  developed, 
mainly  for  the  linear  prediction  vocoder.  The  effects  of 

encoding  and  interpolation  on  speech  naturalness  were  tested 
by  comparing  unquantized  filter  parameters  with  encoded, 
quantized  and  interpolated  versions  of  the  same 
parameters.  Thus  the  implicit  assumption  was 

Speech  synthesized  from  unquantized 

parameters,  extracted  every  10  ms,  is  of  very 
good  quality  compared  to  the  original 
speech.  (Makhoul  et  al.,  1976,  pp.  105). 

Our  contention  is  that  speech  synthesized  from  unquantized 
parameters  is  still  not  natural  and  that  subjects  can  easily 
distinguish  between  natural  and  synthetic  speech. 

From  the  above  two  studies  the  following  observations 
can  be  made: 

i)  Objective  measures  such  as  s ignal- to-noise  ratios 
cannot  evaluate  the  degradation  in  naturalness  due 
to  synthesis. 

ii)  Perceptual  correlates  of  naturalness  are  still  not 
clearly  understood. 
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iii)  Spectral  distances  perform  well  in  evaluating  the 

effects  of  parameter  encoding  and  transmission  but 
cannot  determine  how  "good"  the  parameters 

themselves  are  in  preserving  naturalness. 

iv)  A study  of  the  contribution  of  each  model  parameter 
to  naturalness  is  needed,  where  each  parameter  can 
be  estimated  with  a very  high  reliability  to 
determine  the  limits  of  performance. 

1 . 5 Research  Problem 

From  the  previous  sections  it  is  clear  that  a 
definition  of  naturalness  is  contingent  upon  a clear 
understanding  of  speech  perception,  which  is  yet 

unavailable.  From  a practical  standpoint,  one  can  determine 
the  parameters  of  the  speech  production  model  that 
contribute  to  naturalness.  This  "analysis-by-synthesis" 


approach 

constitutes 

the 

first  step 

in  developing  a 

functional 

description 

of 

naturalness . 

In  this  study,  we 

have  selected  the  linear  prediction  scheme  to  accomplish  a 
systematic  study  of  the  problems  of  naturalness.  As 
described  in  Sec.  1.3. 2. 2,  the  LP  scheme  is  flexible,  easily 
computable  and  eminently  suited  for  real  time  applications 
thus  making  it  the  most  popular  synthesis  scheme  at  this 
time.  Although  synthetic  speech  of  very  high 
intelligibility  can  be  produced  by  an  LP  synthesizer,  it 
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b) 


still  lacks  naturalness.  There  have  been  many  efforts  to 
understand  and  explain  these  degradations  in  naturalness 
(Wong,  1980  , Sarabur  et  al.  , 1978  , Atal  and  David,  1979  , 

Makhoul  et  al.,  1976),  which  are  summarized  below  (see  also 
Figure  1.8). 

a)  The  all-pole  assumption  is  not  an  adequate 

representation  of  speech,  with  nasals  and  nasalized 
vowels  being  poorly  synthesized. 

Reduction  of  the  excitation  signal  to  a simple 

buzz-hiss  form  fails  to  reproduce  the  dynamic 
variations  of  the  natural  excitation. 

Fundamental  frequency  and  voicing/unvoicing 
decisions  are  not  always  accurate.  This  is,  by 

far,  the  most  demanding  computational  part  of  the 
scheme . 

Pitch  synchronous  analysis  is  necessary  for 
improving  naturalness,  but  an  accurate  analysis 
frame  size  cannot  be  easily  determined. 

Tracking  of  stop  and  plosive  sounds  such  as  /g/, 
/t/,  /p/,  etc.  requires  variable  size  analysis 


c ) 


d) 


e) 


frames. 

f)  Highly  stylized  excitation  waveforms  result  in  a 
"buzzy"  sounding  synthesis. 

In  all  the  studies  cited  above,  the  model  parameters  such  as 
the  fundamental  frequency  and  voiced/unvoiced  decision  were 
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derived  from  the  speech  signal  itself.  The  algorithms  used 
to  compute  the  parameters  are  not  very  reliable  because  of 
noise,  mixed  excitation  and  wide  variability  among 
speakers.  In  addition,  no  attempt  is  made  to  include  the 
dynamics  of  the  vocal  source  into  the  synthesis  schemes. 

This  study  is  aimed  at  improving  the  naturalness  of  LP 
synthesis  by  obtaining  the  source  related  parameters 
mentioned  above,  directly  and  independently  using  the 
Electroglottograph  [EGG]  signal.  In  addition,  pitch 
synchronous  analysis/synthesis  is  also  performed  to 
determine  improvements  in  naturalness. 

The  EGG  signal  measures  the  variation  in  tissue 
impedance  across  the  glottis  when  the  two  vocal  folds  are 
vibrating,  producing  a change  in  their  lateral  area  of 
contact.  This  device  provides  a very  good  indication  of  the 
presence  of  glottal  activity  as  well  as  fundamental 
frequency  of  glottal  vibration  (Childers  et  al.,  1982, 
1983a;  Fourcin  and  Abberton,  1971,  Fourcin,  1974).  This 
provides  the  best  opportunity  to  eliminate  the  problems 
associated  with  fixed  frame  analysis  and  FQ  (fundamental 
frequency)  and  voiced/unvoiced  determination.  With  these 
considerations  in  mind,  the  objectives  of  this  research  are 


as  follows: 
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1.5.1  Speech  production: 

a)  Obtain  source  related  model  parameters 
independently  using  the  EGG  waveform  recorded 
simultaneously  with  the  speech  waveform. 

b)  Assess  the  contributions  of  different  excitation 
waveforms  to  the  naturalness  of  LP  synthesis. 

c)  Demonstrate  the  improvement  in  naturalness  with 
better  spectral  tracking  using  pitch  synchronous 
analysis  guided  by  the  EGG  waveform. 

d)  Demonstrate  the  limits  of  speech  naturalness  for 
sentences,  when  the  parameters  have  been  obtained 
in  a very  reliable  fashion. 

1.5.2  Speech  evaluation: 

a)  Discuss  the  reasons  for  the  ineffectiveness  of 

currently  available  objective  measures  for 

evaluating  speech  naturalness. 

b)  Itemize  the  important  features  of  the  speech  signal 

necessary  for  preserving  naturalness  by  a 

subjective  evaluation  procedure. 

1 . 6 Chapter  Summaries 

Chapter  2 examines  the  issues  in  subjective  and 
objective  evaluation  of  speech  attributes,  such  as  their 
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formulation,  implementation  and  limitations.  The  linear 
prediction  scheme  used  in  this  study  is  described  in 
Chapter  3.  Chapter  4 describes  the  Electroglottograph 
signal,  its  characteristics  and  algorithms  for  obtaining  the 
synthesis  parameters  using  the  EGG  signal.  The  various 
excitation  models  and  their  effect  on  naturalness  is  the 
subject  of  Chapter  5.  The  speech  and  EGG  data  used  in  this 
study  are  discussed  in  Chapter  6 and  the  synthesis 
experiments  are  described  in  Chapter  7.  Chapter  8 deals 
with  the  design  and  interpretation  of  listener  evaluation 
ratings,  and  Chapter  9 concludes  the  study  with  a discussion 
and  summary  of  results. 


CHAPTER  2 

ISSUES  IN  SUBJECTIVE  AND  OBJECTIVE 
EVALUATION  OF  SPEECH  SYNTHESIS 

2 . 1 Introduction 

The  acceptability  of  a speech  synthesis  scheme  is 
judged  by  how  closely  the  synthesized  speech  resembles 
natural  speech.  The  attributes  of  speech  that  enter  into 
such  a judgment  are  intelligibility,  quality,  speaker 
recognizability  and  naturalness.  The  relative  importance  of 
each  of  these  attributes  is  determined  by  the  intended 
application  of  the  synthesized  speech.  For  example, 

Automatic  Voice  Response  requires  high  intelligibility  but 
speaker  recognizability  is  not  a major  concern.  But,  voice 
communication  using  vocoders  demands  high  intelligibility  in 
addition  to  speaker  recognizability  and  naturalness. 

The  ultimate  judge  of  the  performance  of  a synthesizer 
is,  of  course,  the  human  user.  Consequently,  the  most 
prevalent  method  of  evaluating  synthesizers  has  been 
"subjective".  This  method  presents  synthesized  speech 
segments  to  human  subjects,  whose  response  to  the  specific 
attributes  of  the  synthesized  speech  is  obtained.  But, 
subjective  evaluation  is  expensive,  elaborate  and  prone  to 
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errors  due 
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to  subject  population  variability  and  testing 
conditions.  Subjective  measures  yield  scores  based  on 
perceptually  defined  quantities,  rather  than  measurable 
speech  signal  quantities.  While  subjective  evaluation  tries 
to  represent  human  performance  in  a systematic  manner,  it 
provides  little  help  in  improving  a synthesis  scheme  by 
isolating  relevent  signal  features  (Barnwell,  1980).  Hence, 
there  is  an  increasing  interest  in  formulating  "objective" 
evaluation  methods,  which  measure  the  speech  attributes  in 
terms  of  speech  signal  characteristics.  Objective  measures 
are  reliable,  inexpensive  and  enjoy  uniformity  and 
consistency  in  evaluation.  By  relating  perceptual 
attributes  to  signal  characteristics,  one  can  improve  the 
design  of  a synthesis  scheme  directly.  Since  listener 
acceptance  is  the  ultimate  "yardstick"  of  synthesis 
performance,  the  validity  and  usefulness  of  the  objective 
measures  should  be  established  if  they  are  to  be  well 
correlated  with  subjective  evaluation  scores.  To  date,  no 
single  objective  measure  can  evaluate  the  attributes  of  a 
synthetic  speech  segment  as  well  as  a human  listener  can. 
The  reasons  for  this  are  many:  1)  A lack  of  understanding  of 
speech  perception,  2)  dependence  of  speech  attributes  on 
language  and  meaning,  3)  limitations  of  the  synthesis  model. 


etc . 
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In  the  following  sections,  we  will  discuss  the  various 
issues  involved  in  the  formulation  and  application  of 
subjective  and  objective  measures.  The  bulk  of  this  chapter 
will  be  a review  of  research  in  this  area,  but  we  will 
conclude  the  discussion  with  a critical  assessment  of 
objective  evaluation. 

2 • 2 Subjective  Measures  of  Intelligibility 
Intelligibility  tests  evaluate  the  ability  of  a human 
listener  to  correctly  identify  a speech  simulus  (syllable, 
word,  sentence,  etc.),  generated  by  a speech  processing 
system.  The  response  of  the  listener  to  the  test  stimulus 
is  in  the  form  of  repeating  the  stimulus,  writing  it  down  or 
selecting  it  from  a set  of  alternative  stimuli  (Voiers, 
1977b) . 

The  systematic  study  of  speech  intelligibility  was 
inspired  mainly  by  the  need  for  evaluating  telephone 
performance.  Consequently,  much  of  the  emphasis  was  on  the 
effects  of  signal- to-noi se  ratio  and  transmission  bandwidth 
on  speech  intelligibility  (Fletcher  and  Steinberg,  1929). 
In  addition,  degradation  in  intelligibility  due  to  masking 
(Stevens  et  al . , 1946),  clipping  and  filtering  (Licklider 

et  al . , 1948)  and  speaking  and  listening  conditions  were 

investigated  in  detail. 
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These  results,  however,  are  applicable  only  to  systems 
that  distort  natural  speech  due  to  noise,  masking,  bandwidth 
restriction,  etc.,  and  not  to  speech  analysis-synthesis 
schemes.  In  the  case  of  the  latter,  the  degradation  in 
intelligibility  is  mainly  due  to  deficiencies  of  the  model 
itself. 

The  study  of  intelligibility  has  tended  to  focus  on  the 
relative  contributions  of  the  analysis  and  synthesis 
parameters  to  the  intelligibility  of  synthesized  speech. 
Wong  and  Markel  (1977)  evaluated  the  intelligibility  of  a 
linear  prediction  vocoder,  when  the  predictor  order,  frame 
rate  and  frame  size  were  manipulated  systematically.  The 
evaluation  was  done  subjectively  using  the  Diagnostic  Rhyme 
Test,  which  will  be  described  later  in  this  chapter.  Voiced 
speech  segments  had  consistenly  high  intelligibility. 
Updating  the  predictor  coefficients  faster  (from  22.5  msec 
to  11.25  msec)  for  unvoiced  segments  improved  the 
intelligibility,  but  reducing  the  number  of  predictor 
coefficients  from  10  to  4 for  the  same  unvoiced  segments 
removed  this  improvement.  Kahn  and  Garst  (1983)  found 
higher  intelligibility  for  males  than  for  females,  in  their 
LP  synthesizer.  Presence  of  nasality  and  whisper  degraded 
intelligibility  for  both  males  and  females.  Agrawal  and  Lin 
(1975)  found  that  in  a formant  synthesizer,  suppressing  the 
magnitude  of  the  second  formant,  degraded  intelligibility 
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considerably,  whereas  suppressing  the  first  formant  produced 
litte  degradation.  This  again  supports  the  finding  of 
Thomas  (1968)  who  attributed  the  high  intell ig i lbi 1 i ty  of 
clipped  speech  to  the  presence  of  a strong  second  formant. 

2.2.1  Articulation  Index 

One  of  the  earliest  measures  of  intelligibility  is  the 
Articulation  Index  (AI)  (French  and  Steinberg,  1947).  It  is 
applicable  to  situations  where  s ignal-to-noise  ratio  is  the 
main  indicator  of  intelligibility.  The  Articulation  Index 
AI  is  defined  as  a number  between  zero  and  one,  which  is 
expressed  as  a sum  in  increments  aA  corresponding  to 
frequency  increments  Af  of  the  total  speech  frequency 


range.  Each  increment  aA  is  only 

a fraction  W of 

the 

maximum  aA  assigned 

to  each  band, 

i . e. , a A = W • AAm. 

In 

one  impl imentat ion , 

the  frequency 

range  is  divided 

into 

20  bands,  so  that  aA 

m 

= 0.05  (Kryter, 

1962).  The  value 

of  W 

is  related  to  the  levels  of  speech  and  noise,  and  it  is 
obtained  from  the  results  of  listener  tests  for  various 
levels  of  speech  and  interfering  noise  (French  and 
Steinberg,  1947).  The  AI  measures  the  degradation  in 
intelligibility  due  to  noise,  which  can  be  measured 
independent  of  the  speech  signal.  Hence,  the  AI  is  not 
applicable  to  the  synthesis  situation  where  "modeling"  noise 
has  no  representation  and  cannot  be  measured  independent  of 
the  speech  signal. 
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2.2.2  Standardized  Test  of  Intelligibility 

Diagnostic  Rhyme  Test.  When  a speech  stimulus  is 
presented  to  a listener,  he  makes  use  of  the  contextual 
information,  such  as  language  structure,  meaning  and 
dialect,  in  addition  to  the  acoustic  structure  to  properly 
identify  the  stimulus.  Thus,  controlling  the  context  of  the 
test  stimulus  is  a central  problem  in  intelligibility 
evaluation.  Miller  et  al.  (1951)  found  that  intelligibility 
scores  for  different  sets  of  test  materials  are  at  variance 
mainly  because  some  test  sets  require  more  contextual 
information  than  others.  They  stressed  the  need  for 
standardized  procedures  and  vocabulary  to  evaluate 
intelligibility. 


The 

Diagnostic  Rhyme  Test 

(DRT) 

( Voiers , 

1983) 

was 

designed 

to  precisely  control 

the 

context  of 

the 

test 

stimulus. 

It  is  a two  choice 

test 

in  which 

each 

item 

contains  a pair  of  rhyming  words,  whose  initial  consonants 
differ  in  a single  distinctive  feature.  The  listener  is 
expected  to  identify  this  feature,  when  each  pair  is 
presented.  The  DRT  tests  the  intelligibility  of  consonants 
only,  since  they  contain  the  bulk  of  information  in  the 
English  language.  The  consonants  are  tested  only  in  their 
initial  position  in  each  word.  The  test  is  based  largely  on 
the  "Distinctive  Features"  theory  of  Jacobson,  Fant  and 
Halle  ( 1952)  and  Fant  ( 1967  ).  It  tests  the  ability  of  the 


34 


synthesizer  to  reproduce  the  features  accurately.  The 
specific  features  employed  are  voicing,  nasality, 
sustention,  sibilation,  graveness  and  compactness.  Each 
feature  is  represented  by  sixteen  pairs  of  rhyming  words. 
For  example,  voicing  is  represented  by  the  pair  "veal-feel", 
in  which  only  the  first  consonant  of  each  word  differs  in  a 
single  distinct  feature,  viz.,  voicing.  The  entire  list  is 
found  in  Voiers  (1977b).  The  final  DRT  score  for  a 
particular  speech  processing  scheme  is  obtained  as 


100 (R  - W) 
T 


where 


S = "true"  percent-correct  responses 
R = number  of  correct  responses 
W = number  of  incorrect  responses 
T = number  of  items  presented. 


The  DRT  has  been  used  by  a number  of  researchers  to  test  the 
intelligibility  of  vocoder  speech  (Wong  and  Markel,  1977; 
Smith  et  al  . 1931).  It  was  found  to  be  reliable  with 

different  speakers,  although  the  reliability  was  lower 
across  the  listener  class,  suggesting  the  need  for  a large 
and  varied  listener  population. 
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2.2.3  Difference  Limens 

Another  subjective  measure  of  speech  which  reflects  the 
ability  of  our  perceptual  mechanism  to  detect  changes  in 
some  specific  attribute  of  the  speech  signal  is  the 
Difference  Limen  ( "Limen"  means  threshold),  also  refered  to 
as  Just  Noticeable  Difference  ( JND)  (Flanagan,  1955).  It  is 
the  smallest  change  in  a specific  characteristic  of  the 
speech  signal  that  produces  a perceptible  difference.  The 
Difference  Limen  (DL)  is  measured  by  presenting  to  a 
listener  group,  a pair  of  synthetic  speech  sounds  that 
differ  from  each  other  in  intensity,  formant  amplitude,  etc. 
by  a controlled  amount.  The  Difference  Limen  then 
corresponds  to  the  smallest  value  of  the  test  attribute 
which  produces  a perceptible  difference  in  50%  of  the 
listener  group  (Flanagan,  1955).  The  Difference  Limens  for 
a)  first  and  second  formant  frequencies,  and  their 
respective  magnitudes  and  bandwidths,  b)  fundmental 
frequency  and  c)  intensity  of  a vowel  sound  have  been 
measured  by  Flanagan  (1972).  The  difference  limen  does  not 
directly  evaluate  speech  intelligibility,  but  merely 
indicates  the  smallest  amount  of  change  in  a specifc 
characteristic  of  the  speech  signal  that  can  be  perceptually 


detected. 
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2 . 3 Subjective  Measures  of  Speech  Quality 

The  concept  of  speech  quality  is  rooted  in  the  total 
auditory  impression  of  a listener.  There  is  no  generally 
accepted  definition  of  speech  quality  and  the  term  quality 
itself  has  been  used  in  different  contexts. 

A phonetician  might  use  "Quality"  in  the  context  of 
articulatory  difference,  for  example:  the  vowels  in  "hot" 
and  "cot"  differ  in  their  "vowel  quality."  But,  a singer 
might  use  quality  in  terms  of  voice  modes  or  registers  which 
are  related  to  the  laryngeal  vibration.  Further,  quality 
might  be  defined  in  such  terms  as  breathy,  hoarse,  crisp, 
smooth,  etc.  For  these  reasons,  a quantitative  definition 
and  evaluation  of  speech  quality  has  not  been  possible. 
However,  several  attempts  have  been  made  to  assess  speech 
quality  in  terms  of  loudness,  comparison  with  reference 
signals  of  preassigned  quality  and  speaker  recognizability 
(Rothauser  et  al.,  1968,  1971).  The  IEEE  Recommended 
Practice  for  Speech  Quality  Measurement  (1969)  outlines  a 
general  framework  for  quality  measurements,  detailing  the 
criteria  for  selection  of  speech  material,  noise  level, 
listening  group,  scoring  strategy,  etc.  The  evaluation  of 
quality  was  done  by  obtaining  subjective  ratings  and  no 
objective  measures  were  proposed.  Preference  measurements 
based  on  the  percentage  of  the  listening  group  that  prefers 
the  test  speech  signal  to  the  reference  speech  signal  as  a 
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source  of  information  were  described  in  detail.  The 
degradation  in  quality  due  to  background  noise,  filtering, 
transmission  distortion,  etc.  was  the  object  of  the 
evaluation  procedure.  The  degradation  in  quality  due  to 
speech  modeling  errors  was  not  dealt  with. 

Multidimensional  scaling  (Shepard  et  al.  , 1972;  Singh, 
1978)  is  a useful  technique  for  determining  the  relative 
importance  placed  by  listeners  on  different  attributes  of 
the  speech  stimulus.  This  method  assumes  that  subjects  use 
a common  set  of  factors  or  dimensions  to  perceive  speech 
quality.  The  scaling  algorithm  determines  from  subjective 
ratings,  the  smallest  number  of  dimensions  needed  to  account 
for  a specified  amount  of  variance  in  these  ratings 
(Flanagan,  1972;  Colton  and  Estill,  1981).  The  labeling  of 
these  dimensions  in  terms  of  signal  characteristics  is  done 
subjectively,  from  a knowledge  of  the  speech  stimulii 
presented. 


Diagnostic  Acceptabl i 1 i ty  Measure, 


in is  is 


standardized  test  for  measuring  the  quality  of  a speech 
segment,  where  quality  is  subjectively  defined  as  overall 
"acceptability"  (Voiers,  1977a).  The  DAM  provides  a 
separate  score  for  the  speech  signal,  the  background  and  the 
total  effect.  The  speech  descriptors  are  "thin,"  "rasping," 
etc.  , and  the  background  descriptors  are 
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"hissing,"  "rumbling,"  and  "buzzing".  The  speech  segments 
are  twelve  phonemically  controlled,  six  syllable 
sentences.  The  DAM  has  been  employed  to  obtain  subjective 
ratings  of  acceptance  of  speech  which  has  undergone  various 
types  of  coding  (Barnwell,  1980,  Barnwell  et  al.,  1982). 


2 . 4 Subjective  Measures  of  Naturalness 


"Naturalness " 

of  a speech 

segment  is 

again  a highly 

subjective 

attribute,  which  in 

this  study 

is 

used  in  the 

sense  of 

"human 

sounding " . 

The  method 

of 

naturalness 

evaluation  has  been  wholly  subjective.  The  naturalness 
ratings  are  obtained  on  an  arbitrary  scale  of  0 to  100 
(Carlson  et  al.,  1979)  when  the  speech  segment  is  presented 
alone  or  as  a binary  decision  of  "natural"  or  "unnatural", 
when  presented  in  a pair.  When  evaluating  naturalness,  the 
intelligibility  of  the  speech  segment  should  be  high.  We 
will  present  the  naturalness  ratings  for  LP  synthesized 
speech  and  discuss  the  results  in  Chapter  8. 

The  subjective  evaluation  methods  described  so  far 
measure  human  perceptual  attributes  using  human  subjects. 
While  they  directly  provide  ratings  of  abstract  quantities, 
they  do  not  individually  consider  specific  signal 
characteristics.  Hence  subjective  ratings  provide  little 
direct  help  in  improving  a synthesis  scheme.  But,  an 
objective  measure  based  on  specific  signal  characteristics 
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could  rate  the  individual  contributions  of  these 
characteristics  and  thereby  suggest  improvements  in  the 
synthesis  design.  Human  speech  perception  depends  on 
context,  semantic  content,  speaker  traits,  etc.  in  addition 
to  the  speech  signal  features.  Clearly,  no  single  objective 
measure  can  account  for  all  these  factors  to  provide  a 
rating  for  speech  quality  or  naturalness.  But  efforts  are 
being  made  to  formulate  objective  distance  measures  which 
can  hopefully  evaluate  at  least  some  aspects  of  speech 
perception.  in  the  following  sections  we  will  briefly 
summarize  the  results  of  such  efforts  and  discuss  their 
limitations  in  evaluating  speech  naturalness. 


2 . 5 Objective  Distance  Measures 
for  Speech  Evaluation 

2.5.1  Properties  of  a Distance  Measure 

An  objective  distance  measure  for  speech  signals  is  a 

nonnegative  number  which  indicates  a suitably  defined 

similarity  or  difference  betweeen  two  speech  data  segments. 

Any  distance  measure  d(x,y)  between  two  speech  data 

frames  x and  y should  at  least  satisfy  the  following 

conditions  (Gray  and  Markel,  1976) 


d (y  , x ) 

Symmetry 

0 for 

X 

* y' 

► 

Positive  definiteness 

0 for 

X 

= y 

— 

d(x, z) 

+ 

d (y , 

z ) 

Triangularity. 
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Any  distance  measure  that  tries  to  evaluate  speech 
quality  or  naturalness  based  on  the  speech  signal  should 
emulate  auditory  processing.  Since  the  human  ear  is 

believed  to  be  very  sensitve  to  the  short  term  spectrum  of 
the  speech  signal,  spectral  based  distances  have  received 
wide  attention.  As  emphasized  earlier,  s ignal-to-noise 

ratios  are  not  appropriate  for  measuring  naturalness,  since 
synthetic  speech  can  sound  very  unnatural  even  when  the 
s ignal-to-noise  ratio  is  high.  The  source  of  unnaturalness 
is  in  the  speech  signal  itself. 

The  currently  available  objective  distance  measures 
have  been  based  on  the  linear  prediction  theory,  which  is 
detailed  in  Chapter  3.  We  will  briefly  describe  the  most 
popular  of  these  distances  below.  These  measures  compare 
two  speech  segments  in  terms  of  their  respective  spectra. 
The  spectra  are  derived  from  the  corresponding  sets  of 
linear  predictive  coefficients. 


2.5.2  Currently  Available  Distance  Measures 

RMS  log  spectral  measure.  Let  the  two  speech  segments 

under  comparision  be  s(n)  and  s'(n).  Using  a linear 

predictive  model,  we  can  represent  the  signal  spectrum  by  an 

all-pole  model  spectrum.  Let  the  corresponding  spectra  be 
G G 1 

z ^ and  A,  ^ z ^ where  A(z)  and  A'  (z)  are  the  inverse  filter 
polynomials  of  s(n)  and  s'(n)  respectively;  G and  G'  are  the 
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corresponding  gain  terms.  The  difference  between  these  two 
spectra  on  a log  magnitude  scale  is 


V(0  ) 


1 n [- 


A(e30  ) 


In  [- 


i 2 


A'  (e 


3 0 


(2.1) 


0 = angle  in  the  z-plane 
it  = half  sampling  frequency. 


Then,  the  log  spectral  measure  is  defined  by  d , where 


(d  )p  = I I v( 0 ) I p d0 

P ZTT  J 1 1 

“IT 

p = 1 for  absolute  log  spectral  measure 
p = 2 for  rms  log  spectral  measure 


(2.2) 


The  computation  of  this  measure  requires  two  FFT's  and  two 
logarithms  to  compute  (2.2)  as  a summation.  But  Gray  and 
Markel  (1976)  have  shown  that  the  Cepstral  distance  measure, 
defined  below,  is  an  equivalent  representation  that  can  be 
computed  efficiently. 

Cepstral  distance  measure.  If  the  roots  of  A(z)  lie 
inside  the  unit  circle,  using  a Taylor  series  expansion  we 
can  write 


In [ A( z ) ] = - l 
k=l 


-k 

z 


(2.3) 
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where  are  the  Cepstral  coefficients  (Oppenheim  and 

Schafer,  1968;  Childers,  1977).  Therefore, 


In  [- 


| A(  e 


je 


Cq  = In  [a 
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-k  k 


1 = l 
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Similarly, 


In  [- 
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A' ( e ^ 9 ) I 2 k= 


= l Ck  e 


-j  k9 


( 2.4  ) 


(2.5) 


( 2.6  ) 


Then  the  difference  between  the  two  spectra  can  be 
represented  by  the  difference  between  the  two  sets  of 
Cepstral  coefficients  Ck  and  C^  as 


d 


2 

2 


<Co  - c'o)2  + 2 l <Ck 

° k=l  K 


(2.7) 


For  computational  purposes,  the  series  can  be  truncated  to  L 
terms  in  (2.7)  to  give 

[ U ( L ) ] 2 = (C  - C')2  + 2 l (C,  - C')2  (2.8) 

k=l  K k 

We  can  treatU(L)  as  the  rms  spectral  distance  between  the 
spectra  of  s(n)  and  s'(n),  where  each  log  spectrum  is 
cepstrally  smoothed  to  L coefficents. 
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Log  likelihood  ratio.  The  criterion  for  obtaining  the 
inverse  filter  polynomial  A ( Z ) from  the  speech  data  s(n)  is 
the  minimization  of  the  square  of  the  error  e(n)  between  the 
actual  speech  sample  s(n)  and  the  linearly  predicted  speech 
sample  s(n).  Let  us  denote  the  squared  prediction  error 
also  called  the  residual  energy  by  a,  where 


e(  n)  = s(  n) 


s 

n 


(2.9a) 


a = l [e( n) ] 2 
n 


(2.9b) 


Thus,  a can  be  treated  as  the  output  of  the  inverse  filter 
A(z),  when  the  input  is  the  speech  signal  s(n),  as  in 
Figure  2.1. a.  Let  us  now  obtain  the  inverse  filter  A'(z) 
corresponding  to  s'(n)  using  the  same  criterion  as  above. 
Obviously,  where  s'(n)  is  passed  through  A'(z),  the  residual 
error  will  be  minimum.  if  s(n)  is  passed  through  A'(z)  as 
in  Figure  2.1.b,  the  resulting  residual  energy  5 will  be 
larger  than  a , since  A'(z)  was  derived  for  s'(n).  The 
ratio  6/a  now  represents  the  "mismatch"  between  A(z)  or 
A'(z)  or  equivalently,  between  the  spectra  of  s(n)  and  s'(n) 
respectively  (Gray  and  Markel  1976). 

The  ratio  6/a  also  called  the  likelihood  ratio  is 

always  greater  than  1,  when  A(z)  and  A'(z)  are  not 

If  they  are,  then  — = 1.  The  logarithm  of 

a 


identical . 
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Figure  2.1. 


Inverse  filter  combination  for  obtaining 
the  log  likelihood  ratio. 
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6/  , referred  to  as  the  log  likelihood  ratio,  can  be 
related  to  the  Itakura  distance  (Itakura,  1975;  Tribolet  et 
al.,  1979). 


2 . 6 Application  of  Distance  Measures 
to  Speech  Evaluation 

The  main  strength  of  the  objective  measures  discussed 
in  the  previous  section  lies  in  their  ability  to  compare  the 
spectra  of  two  speech  segments.  Consequently,  these 
distance  measures  have  been  found  to  be  very  effective  in 
vowel  recognition  (Rabiner  and  Levinson,  1981). 
Figure  2.2.b  illustrates  the  use  of  the  Cepstral  distance 
and  the  log-likelihood  measure  in  comparing  the  vowel  /a/ 
and  /i/  for  subject  JMN . Figure  2. 2. a compares  the  spectra 
of  the  same  vowel  /a/  for  two  subjects  (JMN,  JNL)  . The 
difference  between  the  spectra  of  /a/  and  /i/  is  large  which 
results  in  a distance  measure  of  3.84  dB  (Cepstral  distance) 
and  1.49  dB  (log  likelihood  measure).  But  the  difference 
between  the  spectra  for  the  same  vowel  produced  by  two 
different  speakers  is  1.53  dB  (Cepstral  distance)  and  1.2  dB 
(log  likelihood  measure).  These  two  distance  measures 
compute  the  difference  between  vowel  spectra,  which  is 
consistent  with  our  perceptual  ability  to  phonetically 
identify  the  vowels  (Gray  and  Markel,  1976).  So,  the 
distance  measure  varies  in  direct  proportion  to  the  spectral 
difference.  But,  all  three  vowels  used  here  sound  equally 
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a)  spectra  of  vowel  /a/  for  two  speakers. 


2000  3000  4000  5000 


FREQ (HZ? 

b)  Spectra  of  vowels  /a/  and  /i/  for  the  same  speaker. 


Figure  2.2. 


Application  of  two  objective  distance 
measures  to  vowel  spectra. 
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"natural".  The  distance  measure  should  have  remained  the 
same  in  each  case,  if  it  indeed  measured  "naturalness"  and 
not  simply  the  spectral  difference.  Recently,  these 
measures  have  been  used  to  evaluate  degradations  in  speech 
quality  due  to  distortions  introduced  by  quantization, 
coding,  transmission  channel,  etc.  As  discussed  in  Chapter 
1,  the  distance  measures  were  used  to  evaluate  quality 
degradations  in  two  situations. 


a)  Controlled 

distortions  as 

a 

result  of 

coding , 

f iltering 

and  noise 

were  added 

to 

natural 

speech . 

Linear 

predictive 

coef f icients 

we 

re  computed  for 

both  the 

original  and  distorted  speech  segments  after  proper  time 
synchronization.  The  distance  measures  described 

earlier  were  then  applied  to  the  two  coefficient  sets. 
By  classifying  speech  into  different  phonetic  classes,  a 
composite  measure  was  obtained.  The  measure  was  well 
correlated  with  subjective  scores  obtained  via  the  DAM 
test  described  in  Section  2.3  (Barnwell  1979,  1980, 

Barnwell  and  Quakenbush,  1982). 
b)  The  distance  measures  were  applied  to  two  sets  of 
coefficients  viz.,  the  orginal,  unquantized  set  and  the 
quantized,  transmitted  set  (Makhoul  et  al.  1976; 
Viswanathan  et  al.,  1978). 

But  the  above  studies  have  made  the  implicit 
assumption  that  the  analysis  coefficients  produce 
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synthesized  speech  which  is  essentially  equivalent  to 
natural  speech.  Spectral  similarity  is  computed 
relative  to  coefficient  sets  and  not  original  speech. 
Here  lies  the  reason  why  the  objective  measures  cannot 
"compute"  the  loss  of  naturalness  due  to  the  synthesis 
scheme.  So,  we  must  first  tackle  the  issue,  "what  is 
the  best  analysis-synthesis  procedure  that  minimizes  the 
degradation  due  to  the  model  itself?"  Can  the  "best" 
analysis  procedure  result  in  the  "best"  synthesis,  i.e., 
synthesis  closest  to  natural  speech? 

The  major  part  of  this  study  is  now  concerned  with 
designing  an  analysis  scheme  that  obtains  all  the 
analysis  parameters  very  reliably  and  a synthesis  scheme 
that  uses  these  parameters  along  with  excitation 
functions  that  are  more  realistic  than  the  ones  used 
currently.  We  attempt  to  establish  limits  of 
naturalness  using  subjective  tests,  when  the  analysis- 
synthesis  scheme  is  designed  in  a very  reliable  manner. 


CHAPTER  3 

THE  LINEAR  PREDICTION  MODEL  FOR  SPEECH  ANALYSIS-SYNTHESIS 

3 . 1 Introduction 

This  chapter  provides  a description  of  the  linear 
prediction  theory  as  applied  to  speech  processing.  The 
various  parameters  of  the  LP  model  have  a direct  impact  on 
the  "naturalness"  of  the  synthesized  speech.  Hence  we  will 
explore  in  some  detail  what  these  parameters  mean,  how  they 
are  obtained  and  what  improvements  can  be  made  in  their 
computation . 

During  the  past  two  decades,  several  formulations  of 
the  basic  idea  of  linear  prediction  have  been  proposed.  The 
maximum  likelihood  formulation  of  Itakura  and  Saito  (1970) 
was  the  earliest  of  these.  The  term  "Linear  Prediction"  was 
first  applied  to  speech  processing  by  Atal  and  Schroeder 
(1970),  Atal  and  Hanauer  (1971),  in  what  has  come  to  be 
known  as  the  covariance  method  of  linear  prediction.  Since 
then  several  different  formulations  have  been  proposed,  each 
based  on  different  assumptions  but  leading  to  similar 
results.  Each  formulation  has  provided  a better  insight 
into  the  speech  modeling  problem,  while  computational 
demands  have  generally  dictated  their  choice.  The  text  by 
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Markel  and  Gray  (1976)  and  the  tutorial  by  Makhoul  (1975) 
provide  a comprehensive  treatment  of  the  various  linear 
prediction  schemes. 


3 . 2 Description  of  the  Model 
The  strength  of  the  linear  prediction  model  lies  in  its 
close  resemblance  to  a physical  model  of  speech  which  treats 
the  speech  signal  as  the  output  of  a linear,  time  varying 
system  excited  by  either  a quas i-periodic  waveform  during 
voiced  speech  or  by  a random  noise  signal  during  unvoiced 
speech,  as  in  Figure  3.1.  The  parameters  of  this  model  can 
be  obtained  efficiently  by  the  linear  prediction  analysis 
techniques . 

In  the  above  model,  the  time  varying  digital  filter  has 
the  transfer  function  given  by 


H(  z) 


s ( z ) _ G 

v ( z)  p -k 

1+Ea  . z 
k=l  K 


(3.1) 


This  can  be  written  alternatively  in  the  time  domain  as 
P 

s(n)  = - E a]/s(n-k)  + G u(n)  (3.2) 

k=l  k 

Let  us  assume  that  given  a time  series  of  speech  samples, 
the  current  sample  can  be  estimated  as  a linear  combination 
of  p past  samples.  Mathematically, 
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s ( n ) = -e  a s(n-k) 
k=  1 k 


(3.3) 


The  error  between  the  value  of  the  actual  sample  and  its 
estimate  is 


e ( n ) = s ( n ) - s(n)  = s(n)  + e a s(n-k) 

k=l  k 


(3.4  ) 


equivalently , 


s ( n ) = -e  a s(n-k)  + e(n) 
k=l  k 


(3.5) 


If  the  linear  prediction  model  of  Equation  (3.5)  conforms  to 
the  basic  speech  production  model  given  by  (3.2),  then 


e( n)  = G u( n) 


(3.6  ) 


Thus  the  coefficients  (a^)  identify  the  system,  whose  output 
is  s(n).  The  problem  then  is  to  determine  the  values  of  the 
coefficients  ( aK ) from  the  actual  speech  signal. 

3.2.1  Time  Domain  Approach 

The  criterion  used  to  obtain  the  coefficients  [aR]  is 
the  minimization  of  the  squared  prediction  error  E with 
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respect  to  each  coefficient  a^,  over  some  time  interval, 
where 


E = I [e ( n ) ] 2 (3.8) 

n 

This  leads  to  the  following  set  of  "normal"  equations. 

P 

E a E s(n-k)s(n-i)  = -Es(n)s(n-i)  1 < i < p 
k=  1 n n 

(3.9) 

For  a short-time  analysis,  the  limits  of  summation  are 
finite.  The  particular  choice  of  these  limits  has  led  to 
two  methods  of  analysis,  viz.,  the  autocorrelation  method 
(Markel  and  Gray,  1976)  and  the  covariance  method  (Atal  and 
Hanauer , 1971  ) . 

The  autocorrelation  method  results  in  a filter 
structure  that  is  guaranteed  to  be  stable.  At  the  same 
time,  it  operates  on  a data  segment  that  is  windowed  using  a 
Hanning  or  a Hamming  window,  typically  20-30  msec,  long  (two 
to  three  pitch  periods). 

The  covariance  method,  on  the  other  hand,  gives  a 
filter  with  no  guaranteed  stability,  but  requires  no 
explicit  windowing.  Hence  it  is  eminently  suited  for  pitch 
synchronous  analysis. 

There  are  fast  computational  algorithms  for  each  method 
(Levinson,  1947;  Durbin,  1960;  Atal  and  Hanauer,  1971),  the 
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particular  choice  of  either  is  determined  by  analysis  frame 
size,  stability  and  computational  demand. 


3.2.2  Frequency  Domain  Approach 

The  linear  prediction  method  can  also  be  viewed  as  a 
technique  of  matching  the  spectrum  of  a given  signal  by  an 
all-pole  model  spectrum.  This  idea  is  used  in  the 
formulation  of  spectral  distance  measures  based  on  the  LP 
coefficients,  which  compute  the  distance  between  two  speech 
spectra  (Makhoul,  1975). 

The  LP  method  approximates  a given  signal  spectrum  p(u) 
by  an  all-pole  spectrum  P(u>),  such  that 
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(3.10) 


is  minimized. 

The  error  is  larger  when  P(u>)  is  greater  than  P(u>); 
hence  the  minimization  produces  a better  spectral 
approximation  at  those  regions  where  p(w)  > P(w),  than 
where  P(u>)  < P(u>).  This  results  in  a model 

spectrum  P ( oj  ) that  is  a good  estimate  of  the  "envelope"  of 
the  signal  spectrum.  Further,  the  spectral  peaks  of  P(u) 
are  weighted  most  heavily  in  the  error  criterion  and 

A 

hence  P ( tu ) gives  the  best  model  fit  at  these  peaks  (Makhoul, 
1975)  . 
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In  a practical  implementation,  there  are  many 
considerations  in  the  selection  of  the  model  parameters  and 
analysis  conditions,  dictated  by  the  application  at  hand. 
In  this  study,  our  objective  is  to  synthesize  "natural 
sounding"  speech.  Hence,  the  analysis  conditions  are  chosen 
according  to  their  contributions  to  the  "naturalness"  of  LP 
synthesized  speech. 


3 . 3 Analysis  Conditions 
3.3.1  Choice  of  Method 

From  the  synthesis  viewpoint,  the  choice  between  the 
autocorrelation  or  the  covariance  method  is  determined  by 
stability,  accurate  modeling  of  rapid  changes  in  the  speech 
spectrum  and  speed  of  computation. 

The  autocorrelation  method  requires  a larger  analysis 
frame,  typically  2-3  pitch  periods  for  voiced  sounds  and 
20-30  msec.  for  unvoiced  sounds.  This  results  in  a 
"smearing"  of  the  spectral  detail  for  transient  sounds  such 
as  stops  /g/,  /p/,  /k/,  etc.  The  covariance  method  is 
suited  for  pitch  synchronous  analysis,  since  no  windowing  is 
required.  The  analysis  frame  can  be  located  over  the 
closed-glottis  portion,  if  glottal  closure  exists,  leading 
to  a better  estimate  of  the  formant  frequencies  and 
bandwidths  (Gish,  1981).  We  will  explore  this  in  greater 
detail  and  present  experimental  results  in  Chapter  7.  The 
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covariance  method  does  not  guarantee  stability, 
voiced  sounds  it  is  almost  always  stable, 
filter  is  corrected  by  reflecting  its  z-tra 
inside  the  unit  circle  as  suggested  by  Atal 
1971,  thus  preserving  the  frequency  response  of 
filter  and  ensuring  stability. 


although  for 
An  unstable 
nsform  poles 
and  Hanauer, 
the  original 


3.2.2  Analysis  Frame 

Several  studies  of  analysis  frame  size  and  location  are 
available  (Chandra  and  Lin,  1974;  Rabiner  et  al.,  1976). 
Their  conclusions  were  that  the  pitch-synchronous  covariance 
method  gave  significantly  lower  normalized  squared 
prediction  error  and  more  accurate  formant  frequencies  and 
bandwidths  than  the  autocorrelation  method.  But  the  effects 
of  these  parameters  on  synthesis  were  not  studied,  which  we 
have  detailed  in  Chapter  7. 


3.3.3  Predictor  Order 

The  order  of  the  predictor,  which  corresponds  to  the 
number  of  coefficients,  is  decided  by  the  sampling 
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complex  pole  in 
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of 

interest.  Each 

pole 

requires  two  coefficients 

for 

its 

representation.  In  addition,  two  to  four  coefficients  are 
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for  excitation 


and  radiation  effect 
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depending  on  the  type  of  analysis  used.  A prediction  order 
greater  than  12  to  14  does  not  produce  an  appreciable  change 
in  the  naturalness  of  the  synthesizer.  We  use  an  order  of 
14  in  this  study. 

3 . 4 Fundamental  Frequency  and  Voiced/Unvoiced/Silence 

Estimation 

In  a speech  analysis-synthesis  scheme,  two  of  the  basic 
problems  are  i)  determining  whether  a speech  segment  is 
voiced,  unvoiced  or  silence,  and  ii)  estimating  the 

fundamental  frequency  of  the  voiced  segment.  An  abundance 
of  pitch  detection  algorithms  based  on  speech  analysis  in 
both  time  and  frequency  domains  exists.  Hess  (1982)  gives  a 
survey  of  a variety  of  pitch  detection  schemes.  Rabiner 
et  al.  (1976)  describe  an  evaluation  study  of  different 
pitch  detectors.  Their  conclusions  were: 

a)  Lack  of  perfect  periodicity  in  glottal  excitation 
made  accurate  pitch  determination  difficult. 

b)  Interaction  between  vocal  source  and  vocal  tract 

made  pitch  detection  inaccurate  during  fast  formant 
trans itions. 

c)  Defining  the  exact  beginning  and  end  of  a pitch 

period  is  difficult,  based  on  zero  crossing  or  peak 
picking  schemes. 

d)  Voiced/unvoiced  decision  is  greatly  hampered  by  low 
levels  of  voicing. 
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e)  The  various  pitch  detectors  were  talker  dependent. 

f)  Nonlinear  smoothing  of  the  pitch  contour  was 
required. 

In  this  study  we  have  used  a pitch  detector  based  on 
the  linear  prediction  error  signal,  because  the  same  model 
for  analysis  and  synthesis  can  be  employed  for  pitch 
detection  as  well,  thus  reducing  computation.  The  algorithm 
is  based  on  the  Simplified  Inverse  Filter  Tracking  (SIFT) 
pitch  detector  proposed  by  Markel  (1972). 

The  pitch  detector  first  performs  the  voiced/unvoiced/ 
silence  determination  and  if  the  segment  is  voiced,  its 
fundamental  frequency  is  computed.  The  set  of  algorithms  is 
described  below  (see  Figure  3.2). 

3.4.1  Algorithm;  Speech/silence  determination 

1.  A 50  msec.  silence  portion  is  required  in 
beginning  of  the  data  record  for  setting 
silence  threshold. 

2.  For  speech  samples  s(l) s(500),  compute 

average  absolute  magnitude  threshold 

500 

AMTH  = -jr-QQ  £ | s(  n)  | 

n=  1 

For  each  subsequent  analysis  frame  of  length  N 
samples,  compute 


the 

the 

the 
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a)  Speech/silence  detection. 
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b)  Voiced (pitch) /Unvoiced  decision. 


Figure  3.2. 


Speech  based  pitch  detection  schemes. 
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1 N 

AMSL  = ± Z | s( n) | 
n=  1 


4. 


If  AMSL  > 1.5  * AMTH 
AMSL  < 1.5  * AMTH 


classify  segment  as  Speech . 
classify  segment  as  S i lence . 
Set  pitch  = -1 


3.4.2  Algorithm:  Voiced  ( pitch) /Voiced  determination 

1.  For  each  analysis  frame  of  speech  s(n), 

n=l,2, N,  remove  the  mean,  lowpass  filter  to  1 

KHz  (using  third  order  elliptical  filter  (Markel 
and  Gray,  1976 ) ) . 

2.  Obtain  the  inverse  filter  using  the  autocorrelation 
scheme  (Section  3.2.1). 

3.  Pass  s(n)  through  the  inverse  filter  to  obtain  the 
residue  signal  e(n). 

4.  Compute  the  autocorrelation  function  R(i)  for 
e ( n ) , n= 1 , N where 

N-i 

R(i)  = Z e(n)  e ( n+i ) 
n=l 

5.  Search  for  a peak  in  the  autocorrelation  function 
between  R(20)  and  R(N).  Let  this  be  R(P). 

6.  If  R(P)  > 0.3*R(0)  voiced,  pitch  period  = p 

samples . 

If  R(P)  < 0.3*R(0)  unvoiced,  set  pitch  period=0. 
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The  pitch  contour  obtained  with  this  algorithm  for  the 
sentence  /we  were  away  a year  ago/  for  a male  subject  ( JMN ) 
is  shown  in  Figure  3.3. 

This  method  like  all  the  other  speech-based  pitch 
detection  schemes  has  several  drawbacks: 

1)  Fixed  frame  analysis  averages  voiced/unvoiced 

decisions;  V/UV  transitions  are  in  error,  if  the 
analysis  frame  is  located  in  such  a transition 
region . 

2)  Mixed  excitation  characterized  by  a periodic 

segment  with  a strong  noise  component  due  to 
frication  is  almost  always  classified  as  unvoiced. 

3)  The  error  signal  does  not  have  well  defined  peaks 

for  all  voiced  sounds.  In  particular,  sounds  with 
a strong  first  formant,  viz.,  /oo/  and  /w/  do  not 
yield  an  error  signal  that  possesses  strong 

"spikes"  at  the  start  of  every  pitch  period  and 
hence  the  autocorrelation  function  of  the  error 
signal  lacks  distinct  peaks  at  the  pitch  period 
( Figure  3.4). 

4)  Pitch  doubling  must  be  taken  care  of  by 

neighborhood  decisions. 

5)  Pitch  detection  is  particularly  error  prone  for 

breathy,  whispery  and  nasal  sounds  (Kahn  and  Garst, 
1983)  . 


HO.  or  SftMPI.ES  PER  PERlOri 
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Figure  3.3.  Pitch  contour  for  the  sentence  "we  were  away 
a year  ago"  using  the  SIFT  algorithm. 
Subject:  JMN(male). 
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Incorrect  pitch  detection  due  to  lack  of 
pitch  peak  in  the  autocorrelation  function 
of  the  LP  residue  signal. 


Figure  3.4. 
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3 . 5 Improvements  to  the  LPC  Analysis-Synthesis  Scheme 
Summarizing  the  details  of  the  previous  sections, 
improvements  to  the  LPC  scheme  are  needed  in  the  following 
areas : 

i)  voiced/unvoiced  decision, 

ii)  fundamental  frequency  estimation, 

iii)  pitch  synchronous  analysis  for  accurate  spectral 
representation. 

In  this  study,  we  have  used  the  Electroglottograph 
(EGG)  signal  which  is  derived  from  a tissue  impedance 
measurement  made  across  the  vocal  folds  to  provide  an 
independent  measure  of  the  fundamental  frequency  and  voiced- 
unvoiced  decision.  Pitch  synchronous  LP  analysis  is  also 
performed  using  the  EGG  signal.  The  use  of  the  Electro- 
glottograph signal  to  estimate  the  fundamental  frequency  was 
first  investigated  by  Fourcin  and  Abberton  (1971).  A.  Smith 
(1980)  corroborated  this  result  with  a periodicity 
analysis.  Improved  spectral  tracking  with  the  EGG  signal 
providing  the  pitch  synchronous  analysis  frames  was  first 
studied  by  Gish  (1981)  and  further  supported  by 

Kr ishnamurthy  (1983).  But  to  date,  no  study  has  been 
reported  that  investigates  the  contributions  of  such 
improvements  to  the  "naturalness"  of  synthetic  speech  using 
the  linear  prediction  synthesizer.  We  have  made  such  an 
investigation,  by  employing  an  LP  analysis/synthesis  scheme 
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that  is  guided  by  a synchronously  obtained  EGG  signal.  In 
the  next  chapter,  we  will  discuss  the  nature  of  the  EGG 
signal,  its  characteristics  and  its  use  as  a source  of 
independent  information  about  the  excitation  parameters. 
Algorithms  to  extract  these  features  from  the  EGG  signal 
which  will  be  used  in  the  synthesis  scheme  are  also 
presented. 


CHAPTER  4 

USE  OF  THE  ELECTROGLOTTOGRAPH  SIGNAL  TO  IMPROVE 
THE  LPC  ANALYSIS-SYNTHESIS  SCHEME 

4 . 1 Introduction 

As  explained  in  the  previous  chapter,  a reliable  and 
speech-independent  source  of  information  about  the  vocal 
source  would  eliminate  many  of  the  problems  associated  with 
"naturalness"  in  LP  synthesized  speech.  Toward  this  end,  we 
have  used  the  electroglottograph  signal  for  "glottal 
sensing".  In  this  chapter  we  will  discuss  the  schemes  for 
estimating  the  parameters  of  the  vocal  source  and  the  tract, 
with  the  EGG  signal  guiding  the  analysis. 

4.1.1  Description  of  Electroglottography 

The  Electroglottograph  measures  the  change  in  tissue 
impedance  across  the  vocal  folds.  A high  frequency  current 
source  (3  MHz  - 5 MHz)  is  modulated  by  the  variation  in 
tissue  impedance  of  the  vocal  folds  when  they  contact  and 
separate.  The  demodulated  signal  provides  a signal  that 
varies  in  proportion  to  the  impedance.  Electroglottography 
has  been  reviewed  by  Lecluse  (1977)  and  Fourcin  and  Abberton 
(1971).  Recent  studies  by  Childers,  Smith  and  Moore  (1983) 
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and  Childers  et  al.  (1983a)  have  suggested  a correspondence 
between  the  EGG  signal  and  specific  events  of  the  laryngeal 
vibratory  cycle.  Figure  4.1. a shows  a plot  of  EGG  signal 
for  a vowel  /a/  and  Figure  4.1.b  is  a proposed  model  of  the 
EGG  waveform  (Rothenberg,  1981;  Childers,  1983). 

4.1.2  Features  of  the  EGG  Signal 

The  EGG  signal  is  believed  to  represent  the  variation 

in  the  vocal  fold  lateral  contact  area.  The  specific 

features  of  the  laryngeal  vibratory  cycle  that  can  be 
derived  from  the  EGG  signal  are  i)  period  of  vibration,  ii) 
region  of  glottal  closing,  iii)  region  of  glottal  opening, 
and  iv)  presence  or  absence  of  laryngeal  vibration 
(voicing/unvoicing)  . 

The  advantages  of  the  EGG  signal  in  providing 

information  about  the  source-related  parameters  are 

a)  it  is  non-invasive  and  can  be  obtained 

simultaneously  with  speech, 

b)  it  is  unaffected  by  supraglottal  effects.  Formants 
of  the  vocal  tract  and  source-tract  coupling  have 
no  undesirable  influence, 

c)  it  provides  fundamental  frequency  and 

voicing/unvoicing  decisions  for  males,  females  and 
children  with  equal  ease  and  reliability,  whereas 
speech  based  algorithms  do  not  work  uniformly  for 
all  pitch  ranges, 
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b)  A proposed  model  for  the  EGG  waveshape 

(Rothenberg,  1981,  Childers  et  al . , 1983a). 


Figure  4.1,  The  Electroglottograph (EGG)  signal 
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d)  it  is  a dynamic  indicator  of  vocal  source 
characteristics  and  no  averaging  is  necessary  to 
determine  the  fundamental  frequency. 

We  have  used  the  scheme  shown  in  Figure  4.2  to  improve 
the  LP  analysis/synthesis  scheme,  guided  by  the  EGG  signal. 

The  speech  and  the  EGG  signal  are  obtained 
simultaneously  and  digitized  using  a two  channel  A/D 
converter.  The  EGG  signal  is  differentiated  and  the  glottal 
excitation  parameters,  viz.,  voiced/unvoiced  decision  and 
fundamental  frequency  contour  are  computed  as  outlined  in 
algorithm  4.2.2. 

The  Fq  contour  is  used  to  perform  a pitch  synchronous 
analysis  of  the  speech  data.  Two  types  of  analysis  frames 
are  considered,  a)  pitch-period  long  frames  and  b)  frames 
positioned  over  the  closed  glottis  region.  The  algorithms 
4.3.1  and  4.3.2  are  used  for  the  analysis  cases  a)  and  b) 
respectively.  The  covariance  method  of  linear  prediction  as 
described  in  Chapter  3 is  employed  in  each  case,  resulting 
in  a set  of  LP  coefficients  and  a gain  term  for  each 
analysis  frame. 

This  completes  the  analysis  part,  where  the  model 
parameters  have  been  obtained  very  reliably.  These 

parameters  are  used  to  synthesize  the  original  sentence  with 
the  intent  of  achieving  the  highest  degree  of  "naturalness" 
in  the  synthetic  sentence.  Several  types  of  excitation 
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Figure  4.2.  EGG  guided  linear  prediction  analysis-synthesis  scheme. 
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functions  for  the  synthesis  model  are  also  studied  as 
discussed  in  Chapter  5.  The  actual  synthesis  scheme  is 
detailed  in  Chapter  7. 

We  now  describe  the  algorithms  for  obtaining  the 
voiced/unvoiced  and  fundamental  frequency  parameters  along 
with  pitch  synchronous  LP  analysis. 


4 . 2 Determination  of  Excitation  Parameters  Using  EGG 
4.2.1  Voiced/Unvoiced-Silence  and  Fundamental  Frequency 

The  EGG  signal  indicates  the  presence  or  absence  of 
glottal  vibration  and  provides  a good  indication  of  voicing 
or  unvoicing.  This  also  means  that  a distinction  between 
unvoicing  and  silence  cannot  be  made  from  the  EGG  signal 
alone  since  each  of  these  conditions  corresponds  to  the 
absence  of  glottal  vibration. 

In  the  linear  prediction  synthesis  scheme,  an  accurate 
decision  regarding  voicing  is  the  most  demanding 
requirement.  The  distinction  between  unvoicing  and  silence 
can  be  made  reliably  from  the  speech  signal.  Besides,  the 
synthesis  gain  for  the  silence  region  will  be  very  low 
compared  to  that  for  the  unvoiced  frame  and  this  will 
further  improve  the  distinction  between  them. 

The  EGG  signal  exhibits  two  slope  discontinuities  as  in 
Figure  4.3.  The  sharp  fall  labeled  (b)  corresponds  to  the 


first  contact  of  the  folds  leading  to  closure  and  the  slope 


DIFF 

EGG  EGG 
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Figure  4.3.  EGG  and  differentiated  EGG  waveforms. 
Illustrating,  a-glottal  opening,  and 
b-point  of  first  contact  of  the  vocal 
folds . 
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discontinuity  labeled  (a)  when  present  corresponds  to  the 
region  of  glottal  opening.  These  conclusions  are  based  on 
our  simultaneous  measurements  of  glottal  area,  EGG  and 
speech  as  reported  in  Childers  et  al.  (1982,  1983a). 

Thus,  differentiating  the  EGG  signal  will  enhance  these 
slope  discontinuities  as  in  Figure  4.2.b.  The  algorithm  for 
fundamental  freguency  and  V/UV  determination  is  based  on  the 
differentiated  EGG,  as  given  below: 

4.2.2  Algorithm:  Pitch  estimation  using  EGG 

i)  A 50  msec,  segment  of  the  EGG  signal  corresponding 
to  silence  is  reguired  for  setting  the  threshold. 


ii) 

For  the 

above 

segment , 

differentiate 

the  signal 

with  a 

filter 

of  the 

form  H ( z ) = 

1 -z"1  and 

compute  the  threshold 

, 500 

AMTH  = z E(i) 

i=l 

iii)  For  the  subseguent  data,  set  the  search  window 
length  to  MW  samples.  Differentiate  E(i),i=l,NW 
with  H( z)  = 1— z~l . 

iv)  Compute 

NW 

E E ( i ) 
i=  1 


AMSL 
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If  AMSL  > 1.5*  AMTH  Segment  if  voiced 

If  AMSL  < 1.5*  AMTH  Segment  is  unvoiced  or 

s i lence . 

v)  If  unvoiced  or  silence,  increment  by  NW  and  go  to 
iii),  otherwise  proceed. 

vi)  For  E(i),  i=l,  NW:  Pick  first  negative  peak  as 

reference.  IPTl  = i.  Store  in  array:  I P ( i ) . 

vii)  For  E ( i ) , i=(IPTl  + 10), (IPTl+NW),  differen- 

tiate E(i),  pick  next  negative  peak.  IPT2=i. 
Store  in  array:  IP(i+l). 

viii)  Pitch  period  IPT  = IPT2-IPT1. 

F 

Fundamental  Frequency  = ypr'  where  Fs  = Sampling 
frequency. 

ix)  Position  of  the  negative  peaks  IP(i),  i=l,  .... 
correspond  to  points  of  excitation. 

x)  Set  next  frame  from  IPT2+1  to  IPT2+NW,  go  to  iii). 

The  fundamental  frequency  and  the  voicing  decision  can 

be  reliably  obtained  if  the  EGG  signal  is  strong  and  noise 
free.  This  is  true  for  all  the  normal  subjects  that  we  have 
studied.  But,  the  peak  corresponding  to  glottal  opening  is 
not  present  in  all  the  cases.  Whenever  the  closed  phase 
duration  is  required,  a good  approximation  is  one  half  the 
duration  between  consecutive  negative  peaks  corresponding  to 
one  period. 
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A very  desirable  outcome  of  this  algorithm  is  its 
ability  to  detect  regions  where  glottal  activity  is  present 
but  a constriction  along  the  vocal  tract  produces  a 
frication  source  thus  generating  a "mixed  excitation"  sound 
( Kr ishnamurthy , 1983).  The  speech  based  algorithm  (SIFT) 

used  in  this  study  classified  it  as  unvoiced  due  to  the 
noisy  error  signal  and  lack  of  a strong  pitch  peak  in  its 
autocorrelation  function.  But  the  EGG  provides  a reliable 
pitch  value  although  the  "degree"  of  voicing  is  not 
available.  The  improvements  in  synthesis  for  such  a 
situation  as  this  are  discussed  in  Chapter  7. 


4. 2. 2.1  Comments  on  the  Algorithm 

One  source  of  error  is  the  fluctuation  in  the  mean 
level  of  the  EGG  signal  due  to  adjustments  in  larynx  height 
by  the  subject.  Another  error  can  be  caused  by  EGG 
electrode  displacement  which  results  in  a "trend"  in  the  EGG 
signal.  A simple  zero-crossing  measurement  for  pitch 
determination  will  be  obviously  in  error  in  the  presence  of 
a trend.  But  differentiation  and  peak  picking,  as  done 
here,  provide  more  accurate  results.  An  interactive  pitch 
computation  program  implemented  on  the  HP-2648A  graphics 
terminal  was  employed  to  provide  an  initial  pitch  estimate 
to  set  search  window  size.  The  factor  1.5  in 


voicing/unvoicing  threshold  was  determined  heuristically . 
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Figure  4.4  is  a plot  of  the  pitch  contour  obtained  for  the 
sentence  /we  were  away  a year  ago/  for  a normal  male  subject 
( JMN ) . 

4 . 3 Spectral  Estimation  Using  EGG 
One  of  the  factors  that  degrade  the  naturalness  of 
synthetic  speech  is  the  spectral  estimation  of  a speech 
signal  over  fixed  frames.  It  was  seen  in  the  course  of  this 
study  that  even  for  steady  state  vowels,  the  LP  spectrum 
obtained  with  the  autocorrelation  method  varied  from  frame 
to  frame.  Wong  (1980)  reported  similar  results,  suggesting 
this  to  be  the  cause  of  a "warble"  effect  in  the  synthetic 
signal.  Further,  the  gain  values  obtained  from  the  analysis 
also  show  large  fluctuations  from  frame  to  frame  which 
further  deteriorate  the  synthesis.  Chandra  and  Lin  (1974) 
in  a comparative  study  of  autocorrelation  and  covariance 
methods  for  both  natural  and  synthetic  speech  found  that  the 
pitch  synchronous  covariance  method  gave  better  spectral 
estimates  as  measured  from  formant  freqencies  and  their 
respective  bandwidths,  than  the  autocorrelation  method, 
performed  asynchronously.  Both  methods  produced  poor 
spectral  estimates  when  used  pitch  asynchronously,  as  did 
the  autocorrelation  method  used  pitch  synchronously.  They 
recommended  pitch  synchronous  covariance  method  performed 
over  the  closed  glottis  region  for  improving  synthesis 
naturalness,  but  obtaining  accurate  values  of  pitch  and 
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Figure  4.4.  Pitch  contour  for  the  sentence,  "we  were 
away  a year  ago",  obtained  from  the  EGG 
signal . 
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closed  glottis  regions  was  identified  as  a problem  area. 
The  earlier  results  of  Pinson  (1963)  using  a least  squares 
fit  approach  gave  better  estimates  of  formant  frequencies 
and  bandwidths,  when  the  analysis  was  pitch  synchronous. 

It  is  worthwhile  to  point  out  here  that  the  formant 
synthesizer  represents  each  formant  by  a resonator  and  hence 
an  accurate  specification  of  formant  frequencies  and 
bandwidths  is  essential  for  producing  natural  sounding 
speech.  But,  the  linear  prediction  scheme  simply  fits  an 
all-pole  model  spectrum  to  the  given  signal  spectrum  by  a 
single  recursive  filter,  according  to  the  minimization 
criterion  detailed  in  Section  3.2.1.  Hence  an  explicit 
formant  specification  is  irrelevent  to  the  model.  But  the 
formants  corresponding  to  the  natural  frequencies  of  the 
vocal  tract  can  be  derived  from  the  LP  model  spectrum  by 
assigning  the  spectral  peaks  to  the  formants  (McCandless, 
1974).  We  have  seen  earlier  that  the  estimation  of  LP  model 
spectra  is  influenced  by  the  analysis  conditions,  such  as 
pitch  synchronicity , choice  of  autocorrelation  or  covariance 
method,  etc.  Our  selection  of  the  best  analysis  conditions 
is  decided  by  the  one  that  produces  the  best  spectral 
tracking,  which  would  thus  be  physiologically  consistent 
with  the  smooth  movements  of  the  articulatory  mechanism.  We 
will  further  show,  by  synthesis  experiments,  that  this 
selection  would  be  perceptually  consistent  by  virtue  of  the 
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improvements  in  the  "naturalness"  of  the  synthetic  speech 
generated  with  such  a selection  of  the  analysis  parameters. 

As  discussed  earlier,  pitch  synchronous  analysis  with 
the  covariance  method  performed  over  either  one  period-long 
frames  or  closed  glottis  regions  would  produce  the  best 
formant  frequency  and  bandwidth  values.  The  problem  of 
frame  selection  can  be  resolved  by  using  the  EGG  signal 
obtained  simultaneously  with  the  speech  signal.  The 
algorithm  for  pitch  determination  guided  by  the  EGG  signal 
was  presented  in  Section  4.2.  Once  the  pitch  periods  are 
determined,  the  LP  analysis  using  the  covariance  method  is 
performed  as  per  the  following  scheme. 


1 

Algorithm: 

Analysis  frame  one  pitch 

period  long 

1) 

Speech  data  available  over  one 

pitch  period 

of 

length,  NP  samples  : 

s ( i ) f i 1 / • • • f 

NP.  In  addition, 

initial 

conditions 

s ( i-k ) , k=  1 , , 

. . . p , f rom 

the 

previous 

period  of 

the  speech 

data  are  also 

available 

• 

2) 

Derive 

the  LP 

coef f icients 

ak ' k= 1 , P , 

for 

s(  i ) , i=l , 

NP , using 

the  covariance  method. 

The 

model  gain  parameter  is  G ( [ection  3. 2. 1.2],  p = 
predictor  order. 

Q 

3)  Compute  the  model  spectrum  H(z)  = . A(z)  is 

obtained  in  the  frequency  domain  by  computing  the 
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DFT  of  the  sequence  l,a1,a2, ,ap"  G 

squared  prediction  error,  matched  to  the  signal 
energy . 

4.3.2  Algorithm:  Analysis  frame  positioned  over  the 

closed-glottis  portion 

1)  The  region  of  glottal  closure  is  obtained  from  the 
algorithm  4.2.2,  as  the  segment  between  the 
negative  peak  and  the  following  positive  peak  of 
the  differentiated  EGG  waveform.  To  ensure  correct 
picking  of  the  positive  peak,  the  closed  glottis 
segment  length  is  set  by  default  to  2p+l  samples, 
starting  from  the  negative  peak,  for  cases  where 
positive  peak  may  not  exist  or  may  be  too  noisy. 

2)  For  s(i),i=l, NC  where  NC  = length  of  glottal 

closure,  obtain  the  LP  coefficients  a^,i=l,p  using 
the  covariance  method,  without  any  time  windows. 

3)  Obtain  the  spectrum  as  in  the  previous  algorithm. 

4 . 4 Comparison  of  Speech-Derived 
and  EGG-Derived  Parameters 

4.4.1  V/UV  and  F^  Estimates 

The  improvement  in  voiced/unvoiced  decision  with  the 

use  of  EGG  is  shown  in  Figure  4.5,  compared  with  the  speech 

derived  FQ  contour.  The  EGG  guided  analysis  performs  well 

for  low  pitch  sounds  as  well  as  for  lip  rounded  sounds  such 
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Figure  4.5.  Comparison  of  pitch  contours  obtained  from 
speech (dotted  line)  and  EGG (solid  line)  for 
the  sentence,  "we  were  away  a year  ago". 
Subject:  JMN(male). 
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a)  Fixed  frame  autocorrelation  LP  analysis. 


Figure  4.6.  Comparison  of  formant  frequency  contours 
obtained  from  three  different  analysis 
conditions . 
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b)  Pitch  synchronous  covariance  LP  analysis  over  one 
whole  pitch  period. 


b)  Pitch  synchronous  covariance  LP  analysis  over  glottal 
closed  phase  only. 
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as  /w/  in  the  sentence  used  here.  Speech  based  pitch 
analysis  performs  poorly  for  this  case  as  discussed  in 
Section  3.4.  Low  pitch  speech  requires  longer  frames  to 
obtain  better  autocorrelation  estimates  displaying  a strong 
pitch  peak,  but  voiced/unvoiced  boundaries  are  determined 
incorrectly  in  that  case.  The  EGG  on  the  other  hand 
performs  consistently  for  low  as  well  as  high  pitched  sounds 
and  clearly  yields  very  reliable  FQ  and  V/UV  decisions. 


4.4.2  Spectral  Tracking 

The  spectral  tracking  contours  for  the  same  sentence  as 
above  are  shown  in  Figure  4.6  for  three  methods.  The 
covariance  method  performed  over  the  closed  phase  as 
determined  from  the  EGG  signal  gives  the  smoothest  formant 
contour.  The  fixed  frame  autocorrelation  method  yields  a 
noisy  formant  contour,  whereas  the  covariance  method  for 
analysis  frame  equal  to  one  pitch  period  performs  in  between 
the  above  two  methods.  The  improvements  in  naturalness  of 


synthesis  with  the  above  analysis  frame  sizes  will  be 
described  in  Chapter  7. 


CHAPTER  5 
EXCITATION  MODELS 

5 . 1 Introduction 

We  saw  in  Chapter  4 that  the  EGG  signal  provides  a 
reliable  voiced/unvoiced  decision,  an  accurate  fundamental 
frequency  estimate  and  an  improved  method  for  spectral 
tracking.  One  of  the  remaining  problems  in  linear 
prediction  synthesis  is  the  type  of  excitation  used.  While 
the  accurate  detection  of  voicing  ensures  intelligibility, 
the  naturalness  of  synthesis  is  greatly  affected  by  the 
particular  type  of  excitation  used  to  represent  voicing 
(Holmes,  1973;  Rosenberg,  1971). 

The  linear  prediction  analysis  divides  the  speech 
signal  into  two  components,  a)  system  component  represented 
by  the  coefficients  and  b)  an  excitation  component 
represented  by  the  residue  signal.  If  we  use  the  residue 
signal  as  the  excitation  to  the  synthesizer,  then  we  recover 
the  original  speech  signal.  But  retaining  the  entire  error 
signal  requires  large  data  storage  and  high  bit  rates  for 
transmission.  This  defeats  the  purpose  of  the  LP  technique, 
viz.,  data  compression.  Over  the  years,  many  schemes  have 
been  proposed  to  generate  an  excitation  signal  that 
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resembles  the  residue  signal,  and  at  the  same  time  provides 
savings  in  storage  and  data  rate.  These  fall  into  three 
categories  as  described  below. 

a)  Residue  excited  vocoder.  The  excitation  signal  in  this 

type  of  vocoder  is  derived  from  the  actual  residue 
signal.  Only  a portion  of  the  spectrum  of  the  residue 
signal  is  transmitted.  At  the  receiver,  the  full 

spectrum  is  reconstructed  and  the  excitation  signal  is 
regenerated  as  in  Figure  5.1. a (Arjmand,  1983;  Hedelin, 
1983;  Dankberg  and  Wong,  1979).  The  selection  and 
reconstruction  of  the  spectrum  are  subject  to  errors, 
resulting  in  a degradation  of  naturalness.  This  method 
has  been  implemented  for  medium  bit  rate  [9600  bits/sec] 
transmission . 

b)  Voice  excited  vocoder.  In  this  scheme,  a baseband  of 
the  actual  speech  spectrum  is  transmitted.  At  the 


receiver. 

the  excitation  signal 

is 

obtained  by  a 

spectral 

flattening  operation 

on 

the 

baseband 

(Weinstein, 

1975 ; Atal  et  al.  , 

1975) 

as 

in  Figure 

5 . 1 . b.  This  method,  like  the  residue  excited  vocoder, 
does  not  require  an  explicit  voicing/unvoicing 
determination,  but  the  synthesis  is  less  natural  than 
the  residue  excited  vocoder. 


87 


c 

o 

•H 

-P 

ft 

-P 

U 

X 

CP 


I — I 

ft 

c 

Cn 

•H 

C/3 


ft 

a) 

o 

o 

o 

> 

n 

a) 

-p 

•rH 

o 

X 

a/ 

a) 

p 

v 

•H 

ft 


ft 


a) 

P rH 
ft5  ft 
•H  ft 
ft  Cn 
CD  -H 
Cft  CO 


ft 

o 

■H 

-P 


ft  rH 

-P  ft 
•H  ft 
U Cn 
X -H 
W CO 


0) 

ft 

ft 

ft 

r-* 

u 


ft 

<d 

•p 

o 

V 

o 

> 

rS 

CD 

-ft 

•H 

O 

X 

CD 

CD 

CD 

*H 

o 

> 


JQ 


ft 

O 

•H 

4-> 

5 


ft 
H ft 
O & 
•H 
CO 


a 


• 

c; 

o 

•H 

■P 

ft 

-P 

•H 

O 

X 

CD 

O 

-rH 

P 

•ft 

CD 

£ 

ft 

P 

ft 

PH 


U 


5 £ 

<3  > 


-ft  rH 

O ft 
CJ  ft 
O Cn 

Q-i  -H 
CO  CO 


Figure  5.1.  Excitation  models  for  LP  synthesis. 
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c)  Parametric  excitation.  In  this  case,  both  voicing  and 
unvoicing  are  represented  by  parametric  models.  The 
parameters  of  the  model,  such  as  fundamental  frequency, 
gain  and  unvoicing  duration  are  specified  individually 
as  in  Figure  5.1.c.  The  ability  to  control  the 

excitation  parameters  separately  makes  this  scheme  very 
useful  for  systematically  studying  speech 

intelligibility  and  naturalness.  In  this  study,  we  have 
examined  the  improvements  in  naturalness  with  different 
parametric  models  for  excitation.  We  have  also  proposed 
an  excitation  model  derived  from  the  Electroglottograph 
signal. 

The  importance  of  an  excitation  model  resembling  the 
volume  velocity  waveform  has  been  well  demonstrated  for 
formant  synthesis  (Holmes,  1979;  Rosenberg,  1971;  Yea, 
1983).  But  the  excitation  signal  for  the  LP  synthesizer 
must  resemble  the  residue  signal  and  hence  must  satisfy 
different  requirements. 

5 . 2 Requirements  of  an  Excitation  Signal 
5.2.1  Spectral  Flatness 

We  saw  in  Chapter  3 that  the  linear  prediction 
technique  approximates  the  signal  spectrum  by  an  all-pole 
spectrum.  The  resulting  error  spectrum  is  approximated  by  a 


flat  spectrum, 


39 


| E(u ) | 2 = G2  (5.1) 

Hence,  any  parametric  excitation  model  should  possess  a flat 
spectrum.  Sambur  et  al.  (1978)  suggested  a waveform  derived 
from  Rosenberg's  model  (1971),  which  had  a flat  spectrum  up 
to  1 KHz  only.  This  resulted  in  a "bassy"  speech  which  also 
sounded  unnatural. 

5.2.2  Low  Peak  Factor 

One  persistent  problem  with  LP  synthesis  using 
parametric  excitation  models  has  been  the  presence  of 
"buzziness",  which  makes  the  synthetic  speech  sound 
unnatural.  Several  studies  have  addressed  this  issue  and 
suggested  different  excitation  models  to  remedy  the  problem 
(Sambur  et  al.,  1978;  Wong  and  Markel,  1980;  Atal  and  David, 
1979)  . 

A principal  cause  of  buzziness  is  believed  to  be  the 
short  durational  pulses  used  to  model  voiced  excitation.  In 
such  a case,  after  each  excitation  pulse  goes  to  zero,  the 
synthesized  speech  decays  rapidly  and  remains  at  a small 
value  till  the  next  excitation  pulse  arrives.  The  result  is 
a waveform  that  has  sharp  peaks  at  the  points  of  excitation 
and  relatively  small  values  in  between.  Figure  5. 2. a 
compares  natural  and  synthesized  speech  waveforms  for  the 
vowel  /a/,  for  low  FQ  speech.  An  impulse  excitation  was 
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b)  Synthesized  speech  with  Fant ' s excitation  and  LP 
residue  excitation. 


Figure  5.2. 


Synthesis  with  different  excitation  waveforms. 
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used  in  the  synthesis.  The  synthetic  vowel  was  perceived  as 
"buzzy".  On  the  other  hand,  lack  of  buzziness  in  high  FQ 
speech  has  been  reported  in  the  literature  (Sambur  et  al., 
1978).  For  high  FQ  speech,  the  pitch  periods  are  shorter 
and  hence  the  synthesized  speech  decays  for  a shorter 
duration  before  the  next  excitation  pulse  occurs. 
Consequently,  the  waveform  has  an  appreciable  amplitude  for 
a longer  portion  of  each  period  than  for  the  case  of  low  FQ 
speech.  Therefore,  broader  pulse  shape  is  a requirement  for 
reducing  buzziness.  But,  a very  smooth  waveform  with  little 
high  frequency  content  results  in  a "bassy"  speech  which 
also  sounds  unnatural.  Hence  an  acceptable  model  should 
have  a broad  pulse  shape,  at  the  same  time  possess  a 
reasonably  flat  spectrum.  While  there  are  no  quantitative 
procedures  for  establishing  these  characteristics,  an 
analysis-by-synthesis  approach  used  in  this  study  seems  to 
be  the  only  way  to  determine  an  optimum  shape  for  the 
excitation  signal. 

5.2.3  Correspondence  with  Glottal  Events 

The  excitation  model  should  incorporate  specific 
glottal  events  such  as  opening,  closing  and  closure,  in 
order  to  provide  a physiologically  relevent  signal.  One  can 
study  the  improvements  in  naturalness  when  the  dynamics  of 
the  glottal  source  are  included  in  the  excitation  signal. 
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a)  Basic  pulse  shape. for  the  Rosenberg's  model. 


b)  Waveshape  parameters  (Rosenberg,  1971) 

[with  permission  from  Acoustical  Soc . of 
America] . 


Figure  5.3. 


Rosenberg's  model  for  glottal  excitation. 
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Parameters  such  as  jitter  (perturbation  of  FQ)  , rate  of 
closure,  duration  of  closure,  etc.  can  be  individually 
specif ied. 

5 . 3 Parametric  Models  for  Voiced  Excitation 

5.3.1  Rosenberg's  Model 

Rosenberg  (1971)  derived  the  volume  velocity  waveform 
using  inverse  filtering.  The  basic  shape  of  this  waveform 
was  then  simulated  with  various  artificially  generated  pulse 
shapes.  The  number  and  locations  of  slope  discontinuities 
in  these  pulse  shapes  were  varied.  The  various  waveforms 
are  shown  in  Figure  5.3.b.  Sambur  et  al.  (1978)  used  the 
above  pulse  shapes  to  drive  an  LPC  synthesizer.  To  obtain  a 
flat  spectrum,  the  duty  cycle,  i.e.,  the  ratio  of  open 
duration  to  pitch  period  ( (Tp  + Tn)/T  in  Figure  5.3. a)  was 
fixed  at  12%.  A higher  duty  cycle  resulted  in  a "bassy" 
synthesis.  The  various  pulse  shapes  produced  the  same 
improvements  in  naturalness  as  long  as  the  parameters  Tp/T 
anc*  tn/T  were  fixed.  They  reported  a distinct  improvement 
in  naturalness  over  synthesis  with  an  impulse  source. 

5.3.2  Fant's  Model 

We  have  used  a glottal  excitation  model  proposed  by 
Fant  (1979),  as  illustrated  in  Figure  5.4.  The  slope  of  the 
closing  phase  is  controlled  by  the  parameter  K.  The  high 
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I  (O-Tl)  : U = 0 . 5Ufi  ( 1-cosoj  t) 

U CJ 


II  (T1-T2)  : U = Un  (Kcosco  (t-T2)-K+l) 

0 g ' 


III  (T2-T) 


U = 0.0 


Figure  5.4.  Fant 1 s model  for  glottal  excitation. 
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frequency  part  of  the  spectrum  of  this  waveshape  can  be 
varied  by  changing  the  slope  at  closure. 

The  actual  excitation  waveform  used  to  drive  the  LP 
synthesizer  was  obtained  by  differentiating  Fant's 
excitation  waveform  to  enhance  the  spectral  flatness,  as  in 
Figure  5.5. 

The  effect  of  varying  the  model  parameters  on 
naturalness  will  be  discussed  in  Chapter  7. 

5.3.3  Differentiated  EGG  Waveform 

In  one  of  the  experiments  in  this  study,  the  speech  and 
the  EGG  signals  were  obtained  synchronously.  The  EGG  signal 
was  differentiated  as  in  Figure  5.6.  This  signal  was  used 
as  an  excitation  signal  for  the  LP  synthesizer.  The 
synthesis  filter  had  been  derived  from  the  corresponding 
speech  signal.  The  synthesized  speech  viz.  , the  sentence 
/we  were  away  a year  ago/  sounded  very  natural.  The 
spectrum  of  the  differentiated  EGG  signal  is  reasonably  flat 
up  to  1500  Hz  (Figure  5.7.  a).  The  vowel  /a/  synthesized 
using  the  differentiated  EGG  waveform  is  compared  with  the 
original  speech  in  Figure  5.8.  It  can  be  seen  that  the 
damping  after  excitation  is  much  less  than  it  is  for  impulse 
excited  synthesis  (Figure  5. 2. a).  As  a result,  the 
synthesis  is  less  buzzy.  The  differentiated  EGG  signal  was 
shown  to  provide  a reliable  estimate  of  FQ  and 
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Figure  5.5. 


Fant's  excitation  differentiated  to  increase 
spectral  flatness. 


DIFF 

EGG 
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Figure  5.6. 


EGG  and  differentiated  EGG  waveforms. 


SPECTRAL  MAGNITUDE  IN  DB 
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a)  Spectrum  of  the  differentiated  EGG  waveform. 


b)  Spectrum  of  the  LP  residue  waveform. 


Comparison  of  the  spectra  of  two  excitation 
waveforms . 


Figure  5.7. 


jar andwb 
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Figure  5.8.  Comparison  of  natural  speech  and  synthesized 
speech,  with  differentiated  EGG  excitation. 
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voicing/unvoicing  discrimination.  The  same 
serve  as  an  effective  excitation  signal, 
synthesis  experiments  using  the  differentia 
the  excitation  signal  will  be  discussed  in 


signal  will  a 
The  results 
ted  EGG  signal 
Chapter  7. 
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of 

as 


5 . 4 Unvoiced  Excitation 

Unvoiced  sounds  such  as  /f/  and  /s/  are  produced  by 
exciting  the  vocal  tract  with  a noise  source.  This  is  in 
the  form  of  a turbulance  created  at  a constriction  along  the 
vocal  tract.  In  the  linear  prediction  synthesizer  the  noise 
excitation  is  modeled  by  a uniformly  distributed  random 
number  seguence.  The  naturalness  of  synthesis  is  determined 
primarily  by  the  voiced  segments  and  hence  the  above  model 
is  adeguate  to  synthesize  unvoiced  sounds. 


5 . 5 Summary 

In  this  chapter  we  discussed  the  various  models  used 
for  exciting  the  LP  synthesizer.  The  differentiated  EGG 
waveform  was  shown  to  be  an  effective  excitation  signal. 
The  improvements  in  naturalness  of  synthesis  using  these 
models  will  be  discussed  in  Chapter  7. 


CHAPTER  6 

DATA  COLLECTION  FOR  ANALYSIS-SYNTHESIS 


The  parameters  of  the  LP  analysis-synthesis  scheme  are 
obtained  from  the  speech  and  the  Electroglottograph  (EGG) 
signals.  So,  data  collection  involves  obtaining  these  two 
signals,  time  synchronously,  digitizing  them  for  computer 
processing  and  correcting  for  various  distortions  introduced 
in  the  process.  Each  of  these  stages  will  be  outlined 
below. 


6 . 1 Speech  and  EGG  Data 

The  goal  of  our  synthesis  scheme  is  the  production  of 
natural  sounding  speech.  Therefore,  we  require  that  the 
speech  data  be  intelligible,  normal  and  syntactically 
valid.  Speech  intelligibility  is  generally  tested  in  a 
language  independent  context.  Because,  the  listener  should 
identify  the  speech  sound  without  the  help  of  linguistic 
cues.  But  naturalness  is  judged  by  the  listener's  total 
auditory  impression.  So,  words  are  suitable  for 
intelligibility  testing  and  sentences  are  ideal  for  testing 
naturalness.  Synthesized  sentences  should  retain  the 
natural  durational  and  intonational  cues  and  provide 
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adequate  representation  of  different  phonetic  types.  In  our 
study,  we  have  used  the  following  sentences: 

1)  We  were  away  a year  ago. 

2)  Should  we  chase  those  cowboys? 

Sentence  1 consists  of  voiced  sounds  only  and  is  useful 
for  determining  the  contribution  of  voiced  excitation  models 
to  naturalness.  Sentence  2 has  a mix  of  voiced  and  unvoiced 
sounds  which  is  useful  for  testing  the  reliability  of  the 
voicing/unvoicing  algorithm  and  the  accuracy  of  the  spectral 
representation  at  the  V/UV  boundaries.  Only  sentence  1 was 
used  for  listener  evaluation  tests. 

The  speech  signal  was  obtained  using  a Bruel  and  Kjaer 
Type  2804  microphone  and  recorded  on  Channel  1 of  a Revox 
All  stereo  tape  recorder.  The  subject  group  from  which  the 
natural  sentences  were  recorded  consisted  of  one  male  (JMN), 
one  female  (DDL)  and  one  child  (BJT).  With  the  exception  of 
JMN,  the  rest  were  native  speakers  of  English.  The 
sentences  were  recorded  in  a quiet  room.  Towards  the  end  of 
the  study,  one  set  of  sentences  was  recorded  for  BJT  and  DDL 
in  an  Industrial  Acoustic  Company  sound  treated  room,  using 
an  electrovoice  RE10  microphone. 

The  EGG  signal  was  collected  synchronously  with  speech 
and  recorded  on  Channel  2 of  the  Revox  All  stereo  tape 
recorder.  We  used  two  Electroglottographs  in  this  study, 
one  designed  by  Dr.  Dale  Teany  and  manufactured  by 
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Synchrovoice  Associates,  and  the  other,  designed  by 
Dr.  A.  J.  Fourcin,  called  the  Laryngograph. 

The  relation  between  the  characteristics  of  the  EGG 
signal  and  the  glottal  vibratory  cycle  was  established  using 
ultra-high  speed  films  of  the  larynx  (Moore,  1975).  The 
larynx  was  filmed  at  approximately  5000  frames/sec,  while 
the  subject  produced  a sustained  vowel  sound  /i/.  At  the 
same  time,  the  speech  and  EGG  signals  were  recorded  on 
tape.  The  glottal  area,  length  and  width  were  measured  from 
the  film  using  an  image  processing  system  ( Kr ishnamurthy , 
1981).  Using  a special  timing  circuit,  the  EGG  and  speech 
data  were  aligned  in  time  with  the  film  measurements.  An 
oscilloscope  trace  of  the  EGG  signal  was  also  impressed  on 
the  film,  while  the  larynx  was  being  filmed.  This  provided 
a validation  of  data  alignment.  Our  data  base  of  film,  EGG 
and  speech  exceeds  one  hundred  tasks  for  both  normal  and 
pathological  subjects.  The  above  procedure  is  discussed  in 
detail  along  with  extensive  results  of  data  alignment  in 
Childers  et  al.  (1982,  1983a). 

The  laryngeal  films  can  be  obtained  only  for  certain 
vowel  sounds,  which  expose  the  larynx  adequately.  Any 
movement  of  the  tongue  or  lips  would  thus  make  it  impossible 
to  film  the  larynx.  So,  the  sentences  used  in  our  analysis- 
synthesis  study  have  no  film  data.  Our  basic  assumption  is 
that  the  correspondence  between  the  characteristics  of  the 
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EGG  signal  and  the  laryngeal  vibratory  cycle  established  for 
sustained  vowels,  also  holds  for  voiced  sounds  produced 
dynamically  in  a sentence. 

6 . 2 Data  Digitization 

The  speech  and  EGG  data  were  lowpass  filtered  to  5 KHz, 
sampled  at  10  KHz  and  digitized  using  a 12-bit,  2-Channel 
A/D  Converter. 

The  digitized  data  were  stored  on  disk  for  later 
processing.  The  data  collection  set  up  is  shown  in 
Figure  6.1.  Segments  of  aligned  EGG  and  speech  data  for  a 
male,  female  and  child  are  plotted  in  Figure  6.2. 

6 . 3 Errors  and  Corrections 

Several  corrections  were  necessary  to  compensate  for 
errors  introduced  during  recording  and  alignment  as 
described  below. 

i)  Correction  was  made  for  acoustic  propagation  delay  by 
shifting  the  speech  signal  appropriately,  relative  to 
the  EGG  signal.  The  microphone  was  held  at  a distance 
of  approximately  15  cms.  in  front  of  the  lips.  The 
length  of  the  vocal  tract  for  the  male  subject  was 
assumed  to  be  17  cms.  Taking  the  velocity  of  sound 
to  be  344  meters/sec,  the  propagation  delay  between 
the  larynx  (EGG)  and  the  microphone  (Speech)  was 
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a)  Unvoiced/Voiced  transition  region  from  the  sentence, 
"Should  we  chase  those  cowboys".  Subject:  JMN(male). 


Figure  6.2. 


Segments  of  time  aligned  speech  and  EGG  data. 


EGG  SPEECH  EGG  SPEECH 
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b)  Unvoiced/Voiced  transition  region  from  the  sentence, 
"should  we  chase  those  cowboys".  Subject:  DDL(female). 


c)  A segment  from  the  vowel  / i / . Subject:  BJT(child). 
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0.93  millisecond.  At  a sampling  frequency  of  10  KHZ, 
this  corresponds  to  a delay  of  9 samples.  Assuming 
shorter  vocal  tract  lengths  for  female  and  child 
subjects,  the  vocal  tract  delay  was  set  at  8 and  7 
samples  respectively. 

ii)  Movement  of  the  EGG  electrodes  during  phonation 
resulted  in  "trends"  in  the  EGG  signal,  which 
introduce  errors  in  Fq  and  V/UV  computations.  A trend 
removal  algorithm,  based  on  the  algorithm  in  Childers 
and  Durling  (1975)  was  used  to  correct  this  error. 
The  algorithm  is  described  below. 


a) 

Let  the 

length  of  the  data  array  x(n)  be  N 

samples . 

A 

moving  average  of 

the  raw  data  is 

computed 

over  a rectangular  window  of  length  P 

samples . 

b) 

The  moving 

average  at  point  k 

is  computed  by 

placing 

the 

window  centrally 

over  the  data 

sample  at 

k. 

The  moving  average 

y(k)  at  point  k 

is  given 

by, 

P-1 

y(k)  = i I x(k  - 2zk  + i)  , 
i=0  z 

assuming  P is  odd. 

c)  At  the  two  end  regions,  the  moving  average  is 
computed  as  in  step  b,  but  the  values  outside 
the  data  segment  are  set  to  zero. 
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Illustration  of  trend  removal  from  the  EGG 
data,  a)  raw  data,  b)  after  removing  the 
trend . 


Figure  6.3. 
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d)  The  moving  average  y(k),  k = 0,  N - 1 is 

smoothed  using  a non-linear  median  smoother 
(Rabiner  et  al. , 1975). 

e)  The  corrected  data  sample  x'(n)  is  obtained  as 

x'(n)  = x ( n ) - y(n) 

The  raw  data  with  a trend  and  the  corrected  data 
are  shown  in  Figure  6.3.  A good  value  for  the 
window  length  P is  one  pitch  period. 

iii)  Correction  for  low  frequency  phase  distortion 
introduced  by  the  tape  recorder  was  applied  when 
necessary  (Berouti  et  al.,  1977;  Kr ishnamurthy , 1983). 

iv)  Linear  phase  FIR  filtering  was  applied  to  remove  the 
60  Hz  power  component  when  necessary. 


CHAPTER  7 
SPEECH  SYNTHESIS 


In  this  chapter  we  will  discuss  the  various  synthesis 
experiments  that  were  conducted  to  study  the  production  of 
natural  sounding  synthetic  speech  using  a linear  prediction 
synthesizer.  We  will  describe  the  analysis  conditions  and 
excitation  models  for  the  male,  female  and  child  cases.  A 
preliminary  assessment  of  naturalness  was  made  by  the  author 
subjectively,  in  each  case.  Three  of  the  synthesis  schemes 
were  then  selected  for  listener  evaluation,  which  will  be 
discussed  in  the  next  chapter.  In  the  following  sections, 
we  will  discuss  the  details  of  synthesis  implementation  and 
the  factors  contributing  to  the  naturalness  of  synthesized 
speech. 


7 . 1 Basic  Synthesis  Scheme 

The  basic  linear  predictive  synthesis  scheme  is  shown 
in  Figure  7.1.  It  is  derived  directly  from  the  analysis 
model  discussed  in  Chapter  3. 

We  have  used  the  direct  form  implementation  of  the 
synthesizer.  The  synthesized  speech  s(n)  is  mathematically 
defined  as 
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P 

s(n)  = - l a.s(n-k)  + G u(n) 
k=  1 K 


{aR} 
s ( n- i ) 
G 

u(  n) 


LP  Coefficients 
. th  . 

1 previous  sample  of  speech 
Gain  factor 

Excitation  input. 


(7.1) 


The  goal  of  our  synthesis  experiments  is  to  obtain 
these  parameters  as  reliably  as  possible,  to  produce  natural 
sounding  synthesized  speech. 

7 . 2 Parameter  Specifications 
In  our  LP  synthesizer,  each  of  the  control  parameters 
mentioned  above  can  be  individually  specified.  This  enables 
us  to  systematically  vary  these  parameters  to  assess  their 
relative  importance  in  producing  natural  sounding  speech. 

In  this  section  we  will  enlist  all  the  parameters  that 
were  varied.  In  the  next  three  sections,  we  will  discuss 
the  results  of  synthesis  examples  for  the  male,  female  and 
child  data  sets. 
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7.2.1  Prediction  Coefficients 

The  linear  predictive  coefficients  were  computed  from 
the  speech  signal  according  to  the  analysis  algorithms 
discussed  in  Chapter  3.  The  analysis  frame  was  varied  in 
three  ways. 

a)  Pitch  asynchronous,  fixed  frames  of  20  ms  duration  were 

used.  The  successive  frames  were  shifted  every  10  ms. 

Thus,  two  adjacent  frames  overlapped  over  a 10  ms 
duration  as  in  Figure  7. 2. a.  Both  the  autocorrelation 
and  the  covariance  methods  of  analysis  were  studied. 
The  data  segment  was  windowed  using  a Hanning  window  for 
the  autocorrelation  method  and  a rectangular  window  for 
the  covariance  method.  The  speech  samples  were 

preemphasized  with  a filter  of  transfer  function  H(z)  = 
1 - az--*-,  where  a = 0.95. 

b)  The  analysis  frame  was  exactly  one  period  long,  as  in 

Figure  7.2.b.  The  periods  were  obtained  from  the  time 
aligned  EGG  signal,  as  described  in  algorithm  4.2.2  of 
Chapter  4.  The  covariance  method  of  analysis  was  used, 
with  the  same  preemphasis  filter  as  in  case  (a)  above. 
A rectangular  window  was  employed.  The  initial 

condition  values  were  obtained  from  the  previous  period 
as  in  Figure  7.2.b.  The  analysis  frame  was  shifted  by 
exactly  one  period  and  no  overlapping  was  allowed. 
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a)  Pitch  asynchronous,  fixed  frame  analysis. 


Figure  7.2. 


Frame  size  for  three  analysis  conditions. 
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b)  Pitch  synchronous  analysis  over  one  pitch  period. 
I.C.  - initial  condition. 


c)  Pitch  synchronous  analysis  over  closed  phase  only. 
I.C.  - initial  condition. 
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c)  The  analysis  frame  was  still  pitch  synchronous  as  in 
case  b,  but  the  analysis  was  performed  over  the  glottal 
closed  phase  only.  The  closed  phase  was  obtained  from 
the  aligned  EGG  signal,  according  to  algorithm  4.3.2  of 
Chapter  4.  The  covariance  method  of  analysis  was  used 
with  a rectangular  window  with  no  preemphasis.  The 
initial  condition  values  were  set  as  in  case  b,  above. 
The  frame  location  is  shown  in  Figure  7.2.c. 

7.2.2  Pitch  Contour 

The  pitch  contour  is  an  array  containing  the 
voiced/unvoiced/silence  decisions  for  a long  segment  of 
speech.  The  unvoiced  and  silence  portions  are  set  to  have  a 
fixed  duration,  for  example,  10  ms.  The  voiced  portions 
have  an  entry  in  the  array  specifying  the  duration  of  each 
period.  The  pitch  periods  are  obtained  in  two  ways. 

a)  Pitch  contour  is  derived  from  the  speech  signal.  The  LP 
error  signal  is  used  to  compute  the  pitch  period 
(algorithm  3.4.1  and  3.4.2  in  Chapter  3). 

b)  Pitch  contour  is  derived  from  the  EGG  signal,  which  is 
time  aligned  with  the  speech  signal  (algorithm  4.2.2  of 
Chapter  4 ) . 

The  pitch  contours  from  the  above  two  methods  are 
plotted  in  Figure  7.3.  The  fundamental  frequency  contour 
derived  from  the  speech  signal  is  inaccurate  for  mixed 
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Figure  7.3.  Comparison  of  pitch  contours  derived  from 
speech (dotted  line)  and  EGG (solid  line), 
for  the  sentence,  "We  were  away  a year  ago". 
Subject:  JMN(male). 
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excitation  portions  of  speech  signal,  since  the 
autocorrelation  function  of  the  error  signal  does  not 
contain  a prominent  peak  at  the  pitch  period.  The  same  is 
true  for  sounds  such  as  /w/  with  a strong  first  formant  and 
very  weak  higher  formants,  as  discussed  in  Section  3.4  of 
Chapter  3.  But  the  EGG  based  algorithm  is  not  affected  by 
these  supraglottal  events  and  accurately  measures  FQ  if 
glottal  vibration  is  present. 

7.2.3  Gain  Contour 

The  gain  term  for  each  analysis  frame  is  obtained  by 
matching  the  energy  of  the  synthesized  speech  over  each 
period  to  the  energy  of  the  original  speech.  Under  this 
condition,  the  gain  term  is  computed  from  the  prediction 
residual  signal  during  analysis.  For  the  fixed  frame 
analysis  case,  the  energy  is  obtained  over  20  ms.  and 
interpolated  to  obtain  the  gain  per  period.  This  results  in 
a smooth  contour.  But  in  the  pitch  synchronous  analysis, 
the  gain  term  is  computed  over  exactly  one  period  and  thus 
the  contour  contains  the  rapid  changes  from  period  to 
period.  Figure  7.4  illustrates  this  point.  Hence  pitch 
synchronous  gain  contour  is  better  than  the  fixed  frame  gain 
contour.  Pitch  synchronous  gain  resulted  in  a strong 
sounding  synthesis,  whereas  in  the  fixed  frame  synthesis, 
the  smoothed  gain  resulted  in  a weak  sounding  synthesis. 


NORMALIZED  GAIN 
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Figure  7.4. 


Gain  contours  for  two  analysis  methods. 
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For  unvoiced  sounds,  the  gain  term  is  obtained  over  a 
constant  duration,  for  example,  10  ms.,  in  both  the  pitch 
asynchronous  and  pitch  synchronous  cases. 


7.2.4  Excitation  Signal 

The  different  excitation  models  were  the  subject  of 
Chapter  5.  Now,  we  will  describe  how  these  models  are 
generated  such  that  they  reflect  the  dynamic  variations 
taking  place  at  the  laryngeal  level.  We  have  considered 
three  voiced  excitation  types. 

a)  Impulse  excitation. 

b)  Fant's  excitation. 

c)  Differentiated  EGG  excitation. 

The  unvoiced  excitation  was  a random  number  sequence 
distributed  uniformly  between  -1  and  1.  Mixed  excitation 
was  generated  as  a combination  of  voiced  and  unvoiced 
excitations.  Mixed  excitation  regions  were  detected  when 
the  EGG  signal  indicated  voicing,  but,  the  speech  signal  had 
a low  amplitude  and  a high  zero  crossing  rate,  indicating 
unvoicing,  as  in  Figure  7.5.  Although  the  presence  of  mixed 
excitation  can  be  easily  detected,  there  are  no  reliable 
algorithms  to  detect  the  degree  of  voicing  and  unvoicing  to 
generate  a mixed  excitation  for  the  synthesizer.  After 


122 


Figure  7.5.  Region  of  mixed  excitation. 
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considering  different  weighting  factors  such  as  the  EGG 
amplitude,  autocorrelation  peak  and  zero  crossing  rate,  we 
used  a constant  mixing  factor.  Voicing  excitation  amplitude 
was  set  at  twice  the  amplitude  of  the  unvoiced  excitation. 
Further,  the  gain  matching  of  synthetic  and  original  speech 
signals  ensures  correct  amplitude  for  the  synthesized 
output . 

We  will  now  describe  how  the  information  obtained  from 
the  EGG  signal  is  used  to  generate  the  effective  excitation 
for  the  synthesizer. 

The  aligned  EGG  signal  gives  information  about  i)  the 
duration  of  the  pitch  period,  ii)  the  glottal  closed  phase 
duration  and  iii)  the  points  of  glottal  closure  and  opening, 
as  described  in  Chapter  4.  In  Figure  7.6,  the  upper  plot  is 
the  contour  of  the  pitch  period  duration  and  the  lower  plot 
is  a contour  of  the  duration  of  closed  phase,  both  derived 
from  the  EGG  signal,  for  the  sentence  /we  were  away  a year 
ago/.  This  information  is  used  to  generate  the  excitation 
signal.  An  impulse  excitation  was  generated  in  two  ways, 
a)  an  impulse  of  amplitude  equal  to  -1  at  closure  and  b)  an 
impulse  of  amplitude  equal  to  -1  at  closure  and  another 
impulse  of  amplitude  equal  to  0.25  at  opening,  as 
illustrated  in  Figure  7.7. 

For  Fant's  excitation,  the  duration  of  each  period  and 
closed  phase  were  derived  from  the  EGG  signal.  The  opening 
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Contours  of  the  durations  of  pitch  period  and 
glottal  closed  phase  for  the  sentence,  "We 
were  away  a year  ago".  Speaker:  JMN(male). 


Figure  7.6. 
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Figure  7.7.  Excitation  waveforms  derived  from  the  EGG 
signal . 
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Figure  7.8. 


Excitation  waveforms  derived  from  the  EGG 
signal . 
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phase  (Tl)  was  fixed  at  5/7  the  duration  of  the  open  phase 
( T2 ) , since  this  information  is  not  available  from  the  EGG 
signal.  Fant's  excitation  signal  was  differentiated  once  to 
flatten  the  excitation  spectrum.  The  resulting  waveforms 
are  shown  in  Figure  7.8.  The  third  voiced  excitation  was 
obtained  directly  from  the  EGG  signal  by  differentiating  the 
actual  EGG  signal  that  is  in  alignment  with  the  speech 
signal.  This  signal  was  smoothed  with  a nonlinear  median 
smoother  when  necessary.  The  resulting  waveform  is  shown  in 
Figure  7.8. 

The  above  control  parameters  were  used  to  obtain 
synthesized  speech  for  the  male,  female  and  child  data  sets. 
The  effects  of  varying  these  parameters  on  the  synthesized 
speech  for  each  case  is  discussed  in  the  next  three 
sections . 


7 . 3 Synthesis  of  Male  Speech 
A typical  segment  of  male  speech  is  shown  in 
Figure  7.9,  with  the  aligned  EGG  signal  for  the  normal,  male 
subject  JMN . The  EGG  was  strong  and  noise  free  and  thus 
V/UV  and  Fq  computation  was  very  reliable.  The  speech 
derived  and  EGG  derived  FQ  contours  are  the  same  as  in 
Figure  7.3.  The  average  FQ  for  the  test  sentence  /we  were 
away  a year  ago/  was  125  Hz.  The  subject  had  a long  closed 
phase  in  this  FQ  range,  which  was  confirmed  from  high  speed 
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Figure  7.9.  A typical  segment  of  speech  and  EGG  data 
for  subject  JMN. 
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laryngeal  films  of  the  subject,  obtained  for  sustained 
vowels  (Childers,  et  al.,  1983),  as  in  Figure  7.10(a)  and 
(b).  Although  two  subjects  may  produce  EGG  signals  of 
similar  shapes,  they  may  not  possess  equal  durations  of 
glottal  closure.  The  reason  is  that  the  EGG  signal 
represents  relative  changes  in  impedance  across  the  vocal 
folds  and  minimum  impedance  may  not  necessarily  correspond 
to  complete  closure.  Thus,  confirmation  of  long  closed 
phase  for  this  subject  made  closed  phase  analysis 
theoretically  valid.  The  LP  analysis  was  performed  for  the 
three  conditions  described  in  Section  7.2.1,  for  the 
sentence  /we  were  away  a year  ago/.  They  were, 

i)  Fixed  frame  autocorrelation  method. 

ii)  Pitch  synchronous  covariance  method  over  exactly  one 
pitch  period. 

iii)  Pitch  synchronous  covariance  method  over  closed  phase 
only. 

In  each  case,  the  three  excitations  used  to  derive  the  LP 
synthesizer  were, 

i)  impulse  excitation  at  closure  and  opening. 

ii)  Fant's  excitation  (differentiated). 

iii)  Differentiated  EGG  waveform. 

Each  of  the  excitation  waveforms  were  generated  using 
two  pitch  contours,  as  follows. 
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a)  Glottal  area  and  EGG  waveform. 


b)  Glottal  area,  EGG  and  speech  waveforms. 


Figure  7.10.  Synchronized  plots  of  glottal  area,  EGG  and 
speech  waveforms  for  a male  subject ( JMN) . 
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i)  Contour  derived  from  the  speech  signal. 

ii)  Contour  derived  from  the  aligned  EGG  signal. 

Thus  a total  of  eighteen  synthesis  conditions  were 
studied  for  each  subject,  by  varying  one  parameter  while 
keeping  the  others  same.  Each  synthesized  sentence  was 
played  back  over  both  headphones  and  speakers  in  a sound 
treated  room.  A preliminary  assessment  of  the  naturalness 
of  the  synthesized  speech  was  made  by  the  author  in  each 
case.  The  findings  were  as  follows. 

i)  The  fixed  frame  analysis  resulted  in  an  incorrect 

spectral  representation  for  the  glottal  stop  /w/  and 
the  initial  onset  of  /w/  because  of  spectral  averaging 
over  20  msec,  long  frames. 

ii)  Pitch  synchronous  analysis  followed  the  rapid  changes 
in  the  vocal  tract  spectrum  and  the  transition  for  /g/ 
was  smooth.  The  onset  of  /w/  was  also  heard  more 
clearly  than  in  fixed  frame  analysis. 

iii)  The  errors  in  voicing  introduced  by  the  speech  derived 

pitch  contour  were  clearly  audible.  The  voiced 

segments  incorrectly  labeled  as  unvoiced  segments 
produced  a severe  degradation  in  naturalness. 

iv)  In  contrast,  the  pitch  contour  derived  from  the  EGG 

signal  had  no  V/UV  errors.  Further,  the  contour 
followed  the  per iod-to-per iod  changes  in  the  pitch 
period  duration  (also  called  "Jitter")  and  thus 
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enhanced  the  naturalness  of  the  synthesized  speech. 

v)  Impulse  excitation  at  closure  and  opening  produced 
better  synthesis  than  an  impulse  excitation  at  closure 
only,  for  the  same  set  of  predictor  coefficients. 

vi)  Fant's  excitation  resulted  in  a "bassy"  sounding 
synthesis  because  of  its  low  high-frequency  content. 

vii)  Differentiated  EGG  excitation  resulted  in  a much 
"crisper"  synthesis  because  of  its  flatter  spectrum. 

viii)  For  Fant's  excitation  waveform,  specifying  the  closure 
duration  from  the  corresponding  EGG  signal  resulted  in 
a smoother  synthesis  than  a fixed  duration  for  closed 
phase.  In  other  words,  incorporating  the  details  of 
the  vocal  fold  vibration  improved  the  synthesis. 

ix)  Pitch  synchronous  analysis  over  closed  phase  only, 
resulted  in  a loss  of  naturalness.  The  damping  during 
closed  phase  is  less  than  when  the  glottis  is  open. 
This  results  in  smaller  bandwidths  for  closed  phase 
analysis  than  for  analysis  over  the  whole  period. 
Although,  the  formant  frequencies  are  estimated 
accurately  by  closed  phase  analysis,  the  spectral 
matching  over  the  entire  spectrum  is  poor,  resulting 
in  a poor  quality  synthesis. 

Figure  7.11  illustrates  synthesis  gain  matching  for  a 

long  segment  of  speech.  The  time  waveforms  of  one  natural 

and  three  synthesized  speech  waveforms  are  plotted  in 


amplitude  amplitude 
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Figure  7.11. 


A long  segment  of  original  and  synthesized 
speech,  illustrating  synthesis  gain  matching. 
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Figure  7.12.  The  corresponding  spectrograms  are  shown  in 
Figure  7.13. 

In  summary,  pitch  synchronous  covariance  analysis  over 
one  whole  period  resulted  in  the  best  synthesis.  The 
differentiated  EGG  waveform  and  the  Fant’s  excitation 
waveform  produced  synthesis  of  acceptable  naturalness.  The 
standard  LP  synthesis  with  fixed  frame  analysis  and  impulse 
excitation  resulted  in  a "buzzy”  synthesis.  These  three 
synthesized  sentences  were  presented  to  two  groups  of 
listeners  for  a formal  subjective  evaluation,  which  will  be 
discussed  in  the  next  chapter. 

7 . 4 Synthesis  of  Female  Speech 

A typical  segment  of  speech  and  EGG  waveforms  for 
female  speech  (DDL)  is  plotted  in  Figure  7.14.  The  average 
fundamental  frequency  was  200  Hz.  The  EGG  signal  still  has 
a well  defined  "knee"  corresponding  to  glottal  closure. 
However,  for  the  high  FQ  range  as  in  this  example,  complete 
glottal  closure  may  not  have  taken  place.  The  sharp  fall  in 
the  EGG  signal  may  just  correspond  to  maximum  glottal 
contact.  Further,  the  shape  of  the  EGG  signal  is  very 
similar  to  the  EGG  signal  for  a male  subject  (JMN)  at  a 
fundamental  frequency  of  340  Hz.  There  was  no  glottal 
closure  at  this  F Q range  for  that  subject,  as  confirmed  from 
high  speed  laryngeal  films  (Childers  et  al.  1983a).  Thus 
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a)  1 - Original  speech. 

2 - Pitch  synchronous  analysis,  EGG  excitation. 


b)  3 - Pitch  synchronous  analysis,  Fant ' s excitation. 

4 - Fixed  frame  analysis. 

Figure  7.12.  Comparison  of  original  and  synthesized 

speech  waveforms  for  the  female  sub ject (DDL) . 
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Figure  7.13.  Speech  spectrograms  for  the  sentences  used  in 

this  study. 
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d)  Fixed  frame1  analysis,  impulse  excitation. 


NORMALIZED  AMPLITUDE 
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Figure  7.14.  A typical  segment  of  speech  and  EGG  data 

for  subject  DDL. 
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glottal  excitation  might  still  be  present  during  maximum 
vocal  fold  contact.  Hence,  closed  phase  analysis  cannot  be 
theoretically  justified.  If  indeed  glottal  closure  is 
present,  its  duration  is  very  short  at  high  fundamental 
frequencies.  Hence,  the  analysis  frame  for  the  covariance 
method  cannot  be  placed  completely  within  the  closed  phase 
region.  The  reason  is  that  a predictor  order  of  p requires 
at  least  (p  + 1)  data  samples  (and  p initial  condition 

values)  for  coefficient  computation  and  an  order  of  14  may 
be  too  large.  A reduction  in  prediction  order  will  give  a 
poorer  spectral  matching.  But  pitch  synchronous  analysis 
over  each  period  can  still  be  performed  adequately.  The 
pitch  contour  for  the  test  sentence,  derived  from  the  EGG 
signal  is  shown  in  Figure  7.15.  The  contour  of  closure 
duration  is  based  on  the  change  in  slope  in  the  EGG  signal 
when  the  vocal  folds  separate.  It  may  not  represent  true 
closed  phase  as  discussed  earlier.  The  LP  analysis  was 
performed  for  two  cases,  a)  fixed  frame  autocorrelation 
method  and  b)  pitch  synchronous  covariance  method  over  the 
entire  pitch  period.  Figure  7.16  shows  the  excitation 
waveforms  derived  from  the  EGG  signal.  These  waveforms  were 
used  to  drive  the  synthesizer  as  in  the  case  of  male  speech. 

The  results  of  synthesis  were  as  follows, 
i)  Fixed  frame  analysis  resulted  in  gain  values  averaged 
over  four  to  five  pitch  periods.  This  resulted  in  a 
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Figure  7.15.  Contours  of  the  durations  of  pitch  period 

and  glottal  closed  phase  for  the  sentence, 
"We  were  away  a year  ago". 

Speaker:  DDL (female). 
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a)  Two  types  of  impulse  excitations. 
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b)  Fant ' s and  differentiated  EGG  excitations. 


Figure  7.16. 


Excitation  waveforms  derived  from  the  EGG 
signal . 
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"weak  sounding"  synthesis,  since  the  dynamic  changes 
in  the  amplitude  were  not  accurately  tracked  by  the 
gain  contour. 

ii)  The  F Q contour  obtained  from  the  EGG  signal  was  very 
reliable.  Portions  of  the  contour  were  compared  with 
FQ  values  obtained  manually  from  the  EGG  signal  to 
confirm  its  accuracy. 

iii)  The  pitch  synchronous  analysis  produced  more  natural 
sounding  synthesis  than  the  fixed  frame  analysis. 

iv)  The  differentiated  EGG  was  a better  excitation  than 
the  Fant's  excitation.  Because  of  the  higher  FQ,  the 
latter  did  not  sound  as  "bassy"  as  the  corresponding 
synthesis  for  male  speech. 

v)  Pitch  synchronous  synthesis  reproduced  many  details  of 
the  original  speech  such  as  breath  noise,  glottal 
attack  at  the  beginning  of  /a/  after  silence,  etc. 
This  further  justifies  the  use  of  pitch  synchronous 
analysis  and  synthesis. 

The  time  waveforms  of  the  original  and  three 
synthesized  cases  are  shown  in  Figure  7.17. 
Figure  7.18  illustrates  gain  matching  between  original 
and  synthetic  speech  for  a long  segment  of  the  test 
sentence.  The  original  and  three  synthesized 

sentences  for  the  same  conditions  as  for  male  speech, 
were  used  in  listener  evaluation  sessions,  discussed 
in  the  next  chapter. 
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a)  1 - Original  speech. 

2 - Pitch  synchronous  analysis,  EGG  excitation. 


b)  3 - Pitch  synchronous  analysis,  Fant ' s excitation. 

4 - Fixed  frame  analysis,  impulse  excitation. 

Figure  7.17.  Comparison  of  original  and  synthesized 

speech  waveforms  for  the  child  subject (BJT) . 
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Figure  7.18. 


A long  segment  of  original  and  synthesized 
speech,  illustrating  synthesis  gain  matching 
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7 . 5 Synthesis  of  Child  Speech 
The  test  sentence  /we  were  away  a year  ago/  in  the 
child's  case  ( BJT)  had  two  distinguishing  features,  viz.,  a 
high  fundamental  frequency  of  250  Hz,  and  a stong  nasality 
throughout  the  sentence.  A short  segment  of  speech  and  a 
time  aligned  EGG  waveform  are  shown  in  Figure  7.19.  The  EGG 
signal  still  has  a sharp  slope  at  closure,  but  because  of 
the  high  fundamental  frequency,  true  glottal  closure  may  not 
be  present.  Since  no  laryngeal  films  were  available  in  this 
case,  closed  phase  analysis  was  not  used  for  synthesis.  But 
pitch  synchronous  analysis  over  each  period  and  fixed  frame 
analysis  were  performed  as  earlier.  The  pitch  contour  and 
the  excitation  waveforms  are  shown  in  Figures  7.20  and  7.21 
respectively.  Because  of  the  short  pitch  periods,  proper 

placement  of  pitch  synchronous  frames  was  crucial. 

Placement  errors  resulted  in  clicks  and  degradation  in 
overall  naturalness.  The  results  of  the  synthesis  procedure 
for  the  different  excitations  and  coefficient  analysis 
conditions  were  as  follows. 

i)  Pitch  synchronous  analysis  was  better  than  fixed  frame 
analysis,  as  before. 

ii)  The  synthesized  sentences  had  a distinct  loss  of 
nasality,  compared  with  the  original  speech.  This  is 
an  obvious  weakness  of  the  all  pole  LP  model.  The 
nasal  sounds  contain  strong  spectral  valleys,  which 
are  not  accurately  represented  in  the  LP  spectrum. 
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Figure  7.19.  A typical  segment  of  speech  and  EGG  data 

for  subject  BJT. 
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Figure  7.20.  Contours  of  the  durations  of  pitch  period 
and  glottal  closed  phase  for  the  sentence, 
"We  were  away  a year  ago". 

Subject:  BJT (child). 
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a)  Two  types  of  impulse  excitations. 
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b)  Fant's  and  differentiated  EGG  excitations. 


Figure  7.21. 


Excitation  waveforms  derived  from  the  EGG 
signal . 
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iii)  The  speech  was  generally  perceived  as  "buzzy".  But, 
the  temporal  details  of  the  original  and  synthetic 
segments  were  very  similar  as  seen  in  Figures  7.22  and 

7.23.  One  possible  explanation  is  seen  from  Figure 

7.24,  which  has  the  LP  model  spectra  for  the  vowels 
/a/  and  /i/  for  male,  female  and  child  subjects  used 
in  this  study.  The  higher  formants  for  the  child's 
speech  are  shifted  upwards  in  frequency  compared  to 
those  for  the  male  and  female  cases.  They  are  also 
higher  in  magnitude.  The  presence  of  these  strong 
formants  was  perhaps  perceived  as  "buzzy"  or  "crisp." 

iv)  The  differentiated  EGG  excitation  was  a better 


excitation  function  than 

Fant's 

or 

the 

impulse 

excitations. 

Due  to  the 

high  Fq 

for 

the 

child's 

speech,  Fant's 

excitation 

resulted 

in 

less 

"bassy " 

speech . 

The  synthesized 

sentences  were  evaluated 

by  a 

formal 

listening  test,  which  is  the  topic  of  the  next  chapter. 

7 . 6 Summary 

Speech  synthesis  experiments  for  different  analysis 
conditions  and  excitation  waveforms  were  discussed.  The  EGG 
signal  was  shown  to  be  a valuable  tool  for  pitch  synchronous 
analysis-synthesis  and  for  generating  synthesizer  excitation 
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a)  1 - Original  speech. 

2 - Pitch  synchronous  analysis,  EGG  excitation. 


b)  3 - Pitch  synchronous  analysis,  Fant ' s excitation. 
4 - Fixed  frame  analysis,  impulse  excitation. 


Comparison  of  original  and  synthesized 
waveforms . 


Fiaure  7.22. 
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Figure  7.23.  A long  segment  of  original  and  synthesized 

speech,  illustrating  synthesis  gain  matching. 
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a)  Spectra  for  the  vowel  /a/. 


Figure  7.24 


Comparison  of  vowel  spectra  for  male,  female 
and  child  subjects  used  in  this  study. 
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signals  that  correspond  to  important  events  in  the  glottal 
vibratory  cycle.  The  synthesis  techniques  were  shown  to  be 
applicable  to  male,  female  and  child  subjects  representing 
three  ranges  of  the  fundamental  frequency. 


CHAPTER  8 

EVALUATION  OF  SYNTHESIZED  SPEECH 
8 . 1 Introduction 

In  Chapter  2,  the  various  issues  involved  in  the 
subjective  and  objective  evaluation  of  speech  attributes 
were  discussed.  It  was  pointed  out  that  subjective  rating 
continues  to  be  the  most  widely  accepted  method  of  speech 
evaluation.  Some  of  the  objective  evaluation  schemes 
proposed  in  the  literature  were  reviewed.  These  distance 
measures  have  been  employed  in  measuring  distortions  due  to 
coding,  transmission,  filtering,  etc.  But  these  measures 
are  ineffective  in  measuring  the  "naturalness"  of 
synthesized  speech.  The  reason  is  that  the  model  parameters 
on  which  these  distance  measures  are  based  do  not  adequately 
represent  the  original,  natural  speech.  The  speech 

reconstructed  from  the  model  parameters  is  still 

distinguishable  from  the  original  speech,  and  the 
"naturalness,"  which  was  the  quantity  to  be  measured  has 
been  lost  in  the  modeling  process.  Therefore,  we  focused  on 
the  question,  "how  can  we  obtain  a reliable  representation 
of  the  speech  signal  such  that  the  reconstructed  speech 
sounds  natural?"  Speech  analysis  and  synthesis  using  the 


153 


154 


linear  prediction  model  were  discussed  in  Chapters  3 through 
7.  The  electroglottograph  signal  was  used  as  a source  of 
glottal  information,  to  guide  the  LP  analysis  and  synthesis 
for  different  excitation  models  and  analysis  conditions. 

In  the  absence  of  quantitative  measures  of  naturalness 
we  have  to  use  the  collective  judgment  of  a large  group  of 
listeners  to  discover  the  factors  responsible  for 
synthesizing  natural  sounding  speech.  The  subjective 
evaluation  yields  a rating  based  on  perceptually  defined 
quantities  rather  than  signal  characteristics  that  can  be 
measured.  But  by  presenting  the  synthesized  speech  material 
in  a systematic  manner,  we  can  identify  the  synthesis 
conditions  that  are  "preferred"  by  a group  of  listeners. 
This  chapter  is  a summary  of  the  listening  tests  that  were 
conducted  to  evaluate  the  acceptability  of  the  various 
synthesis  cases  described  in  Chapter  7. 


8 . 2 Evaluation  Procedure 

The  goal  of  subjective  evaluation  is  to  judge  the 
"naturalness"  of  a speech  stimulus.  In  the  context  of  our 
experiments,  naturalness  is  defined  as  "human  sounding".  To 
reiterate  earlier  discussions,  we  require  that  the  speech 
stimulus  be  fully  intelligible.  Recognizability  of  the 
speaker  is  not  a requirement  since  the  speech  should  simply 


sound  "human. 
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8.2.1  Speech  Material 

The  presentation  material  consisted  of  one  original  and 
three  synthesized  sentences  as  below. 

i)  Original  speech. 

ii)  Synthesized  speech  - Pitch  synchronous  covariance  LP 

analysis  and  differentiated  EGG 
signal  as  excitation. 

iii)  Synthesized  speech  - Pitch  synchoronous  covariance  LP 

analysis  and  Fant's  excitation. 

iv)  Synthesized  speech  - Fixed  frame  autocorrelation  LP 

analysis  and  impulse  excitation. 
Each  of  these  four  sentences  were  obtained  for  three 
speakers,  viz.,  a male,  a female  and  a child. 

8.2.2  Presentation  Format 

Our  objective  was  to  discover  which  of  the  above  three 
synthesis  conditions  were  judged  by  the  listener  population 
as  "natural  sounding."  So,  we  used  a paired  comparison 
format,  in  which  the  three  synthesized  sentences  were 
compared  with  the  original  sentence  and  with  each  other. 
The  two  sentences  in  each  pair  were  distinct.  There  were 
six  possible  pairs.  Each  pair  was  presented  on  two  separate 
occasions  in  each  session  and  was  also  presented  in  reverse 
order.  The  order  of  presentation  was  randomized  to 
eliminate  listener  biases  to  the  position  of  a sentence  in 
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each  pair.  Thus  a total  of  24  pairs  of  sentences  were 
generated  for  each  speaker. 

Each  pair  was  presented  to  the  listeners  in  the 
sequence  shown  in  Figure  8.1.  A tone  of  500  Hz  was 
presented  first  as  a cue.  This  was  followed  by  each  pair  of 
sentences  (A,B),  repeated  twice,  with  the  interstimulus 
intervals  as  in  Figure  8.1.  A four  second  pause  followed, 
for  the  listeners  to  mark  their  choice  on  an  evaluation 
form.  This  sequence  was  repeated  for  each  of  the  24  pairs 
of  sentences.  The  listeners  were  asked  to  mark  the  entry  in 
each  pair  that  sounded  more  natural  than  the  other.  No  ties 
were  allowed.  They  were  instructed  to  make  a comparative 
judgment  even  if  both  sounded  very  natural  or  unnatural. 
The  loudness  of  the  synthesized  sentences  was  not  uniform 
for  all  the  sentences  because  of  differences  in  the  computed 
LP  synthesis  gain  parameter.  But  the  listeners  were  asked 
to  judge  the  naturalness  despite  this  difference.  At  the 
end  of  the  test  session,  they  were  asked  to  provide  a short 
comment  on  what  criteria  they  used  to  judge  the  synthesis. 
The  listener  evaluation  form  with  the  instructions  is 
included  in  Appendix  B. 

The  speech  material  was  recorded  on  a Revox  A-77  tape 
recorder  and  presented  via  two  loudspeakers  in  a quiet 
room.  The  session  was  repeated  for  a smaller  listener  group 
using  headphones  in  a sound  treated  room.  The  differences 


157 


N 

E 0 

c 

o 0 

O 4-J 

in 


t 


^4 

•H 

fd 

C4 

E 

44 

d 

>4 

0 

d 

u 

44 

0 

CO 

(0 

u 

*H 

(0 

CU 


44  44 
U CO 
d S4 
4-1  -H 
05  U-| 


to 

“H 

03 

C4 

<D 

O 

c 

0 

4J 

c 

<u 

to 

0 

E 

-P 

14 

O 

44 

0 

u 

c 

0 

3 

G1 

0 

to 

C 

O 

•H 

4J 

d 

44 

£ 

0 

to 

0 

&4 

E 


00 

0 

S4 

2 

Cn 

•H 

&4 


158 


in  evaluation  scores  between  the  two  sets  will  be  discussed 
later. 


8.2.3  Listener  Group 

There  were  two  groups  of  listeners.  The  first  group 
consisted  of  twelve  senior  level  students  majoring  in  speech 
and  their  instructor.  These  listeners  had  no  experience 
with  synthetic  speech,  but  had  an  adequate  knowledge  of 
speech  and  its  attributes  in  general  and  understood  the 
purpose  of  the  test  well.  All  were  native  speakers  of 
English.  The  second  group  consisted  of  six  graduate 
students  in  electrical  engineering  engaged  in  speech 

research  and  Professor  D.  G.  Childers  who  heads  the  speech 
research  group.  Only  two  were  native  speakers  of  English. 
This  group  was  initially  used  as  a trial  group,  but  the 
scores  were  found  to  be  acceptable  and  in  some  ways  very 
informative . 

The  listeners  were  not  individually  screened,  but  a 
consistency  criterion  was  used  to  accept  their  ratings.  If 
the  listeners  responded  identically  to  two  occurences  of  the 
same  sentence  pair  both  in  the  forward  ( A , B ) and  reverse 
( B , A ) order,  then  that  response  was  said  to  be  consistent. 
A listener  was  accepted  only  if  his  or  her  average 
consistency  for  each  session  equaled  or  exceeded  75%. 
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8.2.4  Scoring 

The  sentence  in  a pair  that  was  judged  more  natural 
sounding  than  the  other  was  rated  with  a score  of  one,  and 
the  other  sentence  was  rated  with  a score  of  zero.  The 
total  score  for  each  sentence  per  session  is  its  preference 
rating.  The  maximum  possible  score  was  twelve,  since  each 
sentence  occured  twelve  times  a session. 

8 . 3 Results  of  Synthesis  Evaluation 

We  now  discuss  the  results  of  the  listening  tests  for 
each  listener  group  separately  because  of  the  differences  in 
their  background  and  familiarity  with  speech  synthesis.  The 
subjective  ratings  are  presented  for  the  male,  female  and 
child  subjects  separately.  Each  rating  is  expressed  in  the 
following  tables  as  a percentage,  representing  the  number  of 
times  each  stimulus  was  judged  to  be  more  natural  sounding, 
when  compared  with  all  the  other  stimuli.  For  example,  a 
rating  of  60%  means  that  the  sentence  in  question  was  judged 
as  sounding  natural  60%  of  the  time.  This  rating  indicates 
the  average  attitude  of  the  group  to  each  sentence.  Another 
rating,  refered  to  as  the  relative  preference  rating, 
represents  the  number  of  times  sentence  A was  preferred  to 
sentence  B,  expressed  as  a percentage  of  their  joint 
occurances  in  both  forward  (A,B)  and  reverse  (B,A)  orders. 
For  example,  a relative  preference  rating  of  60%  for 
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sentence  A compared  with  sentence  B means  that  out  of  all 
the  joint  occurences  of  the  two  sentences,  A was  preferred 
to  B 60%  of  the  time.  The  statistics  were  computed  on  a 
Data  General  Corporation  NOVA-4. 

8.3.1  Listener  Group  One 

This  group  was  comprised  of  students  majoring  in  speech 
as  mentioned  earlier.  Out  of  the  thirteen,  only  seven  had  a 
consistency  score  exceeding  75%.  Table  8.1  contains  their 
rating  for  the  four  sentences.  Tables  8.2  - 8.4  contain  the 
relative  preference  ratings  for  male,  female  and  child 
speakers  respectively.  The  ratings  were  averaged  across  all 
seven  listeners  in  each  case.  The  results  were  as  follows. 

1)  Sentences  1,  2,  3,  and  4 were  rated  in  that  order,  on 
the  average. 

2)  Sentences  1 and  2 received  very  close  ratings  and  were 

preferred  equally,  for  male  and  female  speakers.  So, 
pitch  synchronous  analysis-synthesis  with  the 

differentiated  EGG  as  the  excitation  was  on  the 
average  preferred  equally  with  the  original  speech. 

3)  The  widest  listener  disagreement  occurred  between 

sentences  2 and  3.  These  two  sentences  were  both 
generated  with  pitch  synchronons  analysis,  but 

differed  in  the  synthesis  excitation  only. 

Sentence  2,  using  the  differentiated  EGG  as 
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Table  8.1  Average  ratings  for  listener 
group  one  (in  percentage) 


Speaker 

Male 

Female 

child 

Average 

Speech 

Sentence  1 

61.1 

69.4 

68 . 1 

65.3 

Sentence  2 

51.3 

69.4 

65.3 

62 . 9 

Sentence  3 

47.2 

41.7 

34.7 

40.3 

Sentence  4 

40.3 

19.4 

31.9 

31.5 

Sentence  1 
Sentence  2 


Sentence  3 


Sentence  4 


Original  speech. 

Synthesized  speech  using  Pitch- 
synchronous  Covariance  LP  analy- 
sis and  Differentiated  EGG  as 
excitation . 

Synthesized  speech  using  Pitch- 
synchronous  Covariance  LP  analy- 
sis and  Fant ' s excitation. 

Fixed  frame  Autocorrelation  LP 
analysis  and  impulse  excitation. 
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Table  8.2  Relative  preference  ratings  for 
Male  speaker  (in  percentage) 


A 

B 

Sentence 

2 

Sentence 

3 

Sentence 

4 

Sentence 

1 

50.0 

62.5 

70.8 

Sentence 

2 

50.0 

54.2 

Sentence 

3 

58.3 

Table  8.3  Relative  preference  ratings  for 
Female  speaker  (in  percentage) 


B 

A 

Sentence 

2 

Sentence 

3 

Sentence 

4 

Sentence  1 

45.8 

79.2 

83.3 

Sentence  2 

70.8 

91.7 

Sentence  3 

66.7 
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Table  8.4  Relative  preference  ratings  for 
Child  speaker  (in  percentage) 


B 

A 

Sentence 

2 

Sentence 

3 

Sentence 

4 

Sentence  1 

50.0 

75.0 

70.8 

Sentence  2 

70.8 

83.3 

Sentence  3 

41.7 
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excitation,  produced  a "crisp"  synthesis;  whereas, 
sentence  3 with  the  Fant's  excitation  produced  a 
"bassy"  synthesis.  The  listeners  had  sharp 

disagreements  in  judging  which  of  the  two  sounded  more 
natural . 

4)  Sentence  4,  which  was  generated  with  fixed  frame 
analysis  and  impulse  excitation,  was  consistently 
ranked  as  last.  This  justifies  the  use  of  pitch 
synchronous  analysis-synthesis  and  non-impulse 
excitations. 

5)  The  listeners'  written  comments  yielded  some 

interesting  observations.  Some  selected  the  bassy 
speech,  and  termed  it  "richer",  whereas  others 
preferred  the  "crispy"  speech  and  termed  it 

"clearer."  This  dichotomy  was  very  strong  for  male 
speech,  as  mentioned  earlier.  In  general,  buzzy  speech 
was  discarded.  Many  reported  little  difference 

between  sentences  in  a pair.  While  some  relied  on 
local  cues  such  as  how  individual  words  were 
perceived,  almost  all  listeners  judged  naturalness  as 
a global  attribute,  based  on  the  total  auditory 
impression.  This  underscores  the  difficulty  in 
defining  naturalness  as  a measurable  quantity.  All 
the  listeners  reported  high  intelligibility  and  were 
able  to  distinguish  between  male,  female  and  child 


voices . 
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8.3.2  Listener  Group  Two 

This  group  consisted  of  six  graduate  students  engaged 
in  speech  research  and  Dr.  D.  G.  Childers  who  heads  the 
Mind-Machine  Interaction  Research  Center,  at  the  University 
of  Florida.  The  listening  conditions  and  test  material  were 
the  same  as  for  the  first  listener  group.  Out  of  the  seven 
listeners  in  this  group,  only  five  were  selected  since  their 
consistency  rating  exceeded  75%.  Table  8.5  lists  the 
average  ratings  for  the  four  sentences.  The  relative 
preference  ratings  for  male,  female  and  child  speakers  are 
given  in  Tables  8.6  - 8.8. 

The  results  of  the  test  were  as  follows. 

1)  The  widest  disagreement  was  again  between  sentence  2 
and  sentence  3 for  the  male  speaker.  Sentence  3 was 
rated  higher  than  sentence  2.  For  female  and  child 
speakers  the  naturalness  ratings  for  sentences  1,  2, 
3,  and  4 were  in  that  order. 

2)  This  group  reported  more  buzziness  in  the  child's 
voice  than  the  first  listener  group.  The  high  fq  of 
the  child's  voice  was  perhaps  interpreted  as  sounding 
"crisp . " 

3)  The  ratings  for  female  speech  were  biased  towards 
loudness.  Sentences  1 and  2 were  preferred  to 
sentence  4 by  100%.  But  a discussion  which  followed 
this  task  seems  to  have  changed  the  response  pattern 
for  the  child's  case. 
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Table  8.5  Average  ratings  for  listener 
group  two  (in  percentage) 


^v^Speaker 

Male 

Female 

Child 

Average 

Speech 

Sentence  1 

71.7 

78.3 

71.7 

73.9 

Sentence  2 

36.7 

70.0 

61.7 

56.1 

Sentence  3 

48.3 

41.7 

40.0 

43.3 

Sentence  4 

43.3 

10.0 

26.7 

26.7 

Table  8.6  Relative  preference  ratings  for 
Male  speaker  (in  percentage) 


B 

A 

Sentence 

2 

Sentence 

3 

Sentence 

4 

Sentence  1 

85.0 

65.0 

65 . 0 

Sentence  2 

55 . 0 

4 0. 0 

Sentence  3 

50.0 
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Table  8.7  Relative  preference  ratings  for 
Female  speaker  (in  percentage) 


A 

B 

Sentence 

2 

Sentence 

3 

Sentence 

4 

Sentence 

1 

70.0 

65.0 

100.0 

Sentence 

2 

80.0 

100.0 

Sentence 

3 

70.0 

Table  8.8  Relative  preference  ratings  for 
Child  speaker  (in  percentage) 


B 

A 

Sentence 

2 

Sentence 

3 

Sentence 

4 

Sentence  1 

45.0 

80.0 

90.0 

Sentence  2 

65.0 

65.0 

Sentence  3 

25.0 
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4)  The  average  rating  is  consistent  with  that  for  group 
one . 


8.3.3  Listener  Group 

Three 

Two  of  the  listeners  from  group  two 

listened  to 

the 

same  sentences  using 

headphones  in  a sound 

room,  for  a 

more 

critical  evaluation. 

The  consistency  was 

greater  than 

80% 

for  both  listeners. 

Table  8.9  has  their 

ratings  for 

the 

four  sentences. 

1)  The  "bassy"  sentence  was  not  preferred  when  the 

headphones  were  used.  The  imperfections  and 

background  noise  were  easier  to  detect  in  this 
session.  The  choice  of  headphones  or  speakers  is 
decided  by  the  intended  application  but  some 
standardization  is  needed  in  this  area. 

2)  The  standard  LP  synthesis,  viz.,  fixed  frame  analysis 
with  impulse  excitation  was  least  preferred,  whereas, 
pitch  synchronous  synthesis  fared  better.  Synthesis 
with  EGG  excitation  was  rated  high. 

3)  Buzziness  in  synthesis  was  perceived  more  strongly 
than  over  loudspeakers.  But  some  cues  such  as  breath 
noise,  glottal  attack  were  used  to  make  judgments  of 
naturalness . 

The  rating  was  still  in  the  same  order  as  before, 
i.e.,  sentences  1,  2,  3,  and  4,  rated  in  that  order, 

but  the  preferences  were  well  separated  in  this  case. 


4) 


169 


Table  8.9  Average  ratings  for  listener  group 

three  using  headphones  (in  percentage) 


^\Speaker 

Speech 

Male 

Female 

Child 

Average 

Sentence  1 

79.2 

83 . 3 

91 . 7 

84.7 

Sentence  2 

87.5 

83 . 3 

62.5 

77 . 8 

Sentence  3 

20 . 8 

16.7 

29.2 

22.2 

Sentence  4 

12.5 

16 . 7 

16.7 

15 . 3 
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8.3.4  Comparison  of  Two  Pitch  Synchonous  Synthesis  Cases 

In  Chapter  7,  we  discussed  two  pitch  synchronous 
analysis-synthesis  conditions,  viz.,  a)  analysis  over  one 
whole  period  and  b)  analysis  over  glottal  closed  phase 
only.  In  this  section,  we  discuss  results  of  subjective 
ratings  of  these  two  synthesis  types.  In  each  case,  two 
excitation  signals  were  used,  viz.,  a)  differentiated  EGG 
and  b)  differentiated  Fant's  excitation. 

The  listening  test  set  up  was  the  same  as  the  one 
discussed  earlier.  Only  the  male  speaker  was  used  for  this 
session,  since  closed  phase  had  been  confirmed  for  this 
subject  using  high  speed  laryngeal  films.  Table  8.10 
summarizes  the  evaluation  results. 

1)  When  the  analysis  was  performed  over  closed  phase,  but 
the  synthesis  excitations  differed,  listeners 
preferred  EGG  excitation  to  Fant's  excitation. 

2)  When  the  synthesis  excitation  was  the  same,  analysis 
over  one  pitch  period  was  strongly  preferred  to 
analysis  over  closed  phase  only. 

In  summary,  closed  phase  analysis  did  not  produce 
acceptable  synthesis,  compared  with  analysis  over  the  whole 
pitch  period.  Note  that  in  these  two  pitch  synchronous 
methods,  the  analysis  frame  locations  are  different,  viz., 
a)  over  closed  phase  only  and  b)  over  one  whole  pitch 
period.  But  during  synthesis,  the  predictor  coefficients 
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Table  8.10  Number  of  times  sentences  A and  B 


were  chosen  during  their  joint 
occurences  (in  percentage) 


Listening 

^\crroup 

A B 

One 

Two 

Three 

1 2 

15.4  84.6 

3.5  96.5 

12.5  87.5 

1 3 

32.7  67.3 

14.3  85.7 

0.0  100.0 

2 4 

30.8  69.2 

25.0  75.0 

0.0  100.0 

Sentence 

1 

Analysis  over  closed  phase  only, 
Synthesis  using  Fant ' s excitation. 

Sentence 

2 

Analysis  over  closed  phase  only. 
Synthesis  using  EGG  excitation. 

Sentence 

3 

Analysis  over  one  pitch  period. 
Synthesis  using  Fant ' s excitation. 

Sentence 

4 

Analysis  over  one  pitch  period, 
Synthesis  using  EGG  excitation. 
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obtained  in  each  case  are  used  to  generate  one  pitch  period 
of  speech  and  thus  the  synthesizer  is  updated  every  pitch 
period,  in  each  case. 


8 . 4 Conclusions 

listening  tests  have  established  the  following 

results . 

1)  Pitch  synchronous  analysis-synthesis  over  one  whole 
pitch  period  was  preferred  to  standard  fixed  frame 
analysis . 

2)  Non-impulse  excitations  were  consistently  preferred  to 
standard  impulse  excitation.  Differentiated  EGG 
excitation  was  chosen  over  Fant 1 s excitation,  except 
for  synthesis  of  male  speech,  where  the  two  were 
preferred  equally. 

3)  Pitch  syncnronous  analysis  over  one  whole  pitch  period 
was  preferred  to  pitch  synchronous  analysis  over 
glottal  closed  phase  only. 


CHAPTER  9 

SUMMARY  OF  RESULTS  AND  CONCLUSIONS 
9 . 1 Summary 

In  this  study,  we  examined  a number  of  issues  involved 
in  the  production  and  evaluation  of  synthesized  speech.  We 
employed  the  linear  prediction  model  for  analysis  and 
synthesis  because  of  its  flexibility,  ease  of  computation 
and  wide  application  in  consumer  products. 

Different  approaches  to  speech  evaluation  were  reviewed 
in  Chapter  2.  Subjective  evaluation  continues  to  be  the 
popular  evaluation  scheme.  But  this  technique  does  not 
explicitly  measure  the  contributions  of  specific  signal 
features  to  speech  naturalness.  Hence,  there  is 
considerable  interest  in  developing  objective  measurement 
procedures.  The  research  efforts  in  this  area  were 
reviewed.  We  argued  that  these  distance  measures  were 
ineffective  because  they  did  not  adequately  model  the 
original  speech  signal.  These  distance  measures  have  been 
used  to  evaluate  degradation  in  speech  due  to  transmission, 
coding,  etc.,  but  they  cannot  measure  how  inadequate  the 
underlying  model  itself  is.  For  example,  speech  synthesized 
using  standard  LP  analysis  methods  is  quite  distinguishable 
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from  the  original  speech.  The  LP  coefficients  thus  cannot 
objectively  measure  naturalness,  which  has  been  lost  in  the 
modeling  process.  So,  we  focused  our  study  on  improvements 
in  the  LP  model  from  the  point  of  view  of  synthesis.  The 
problem  areas  in  LP  synthesis  were  identified  and  the 
following  improvements  were  proposed. 

1)  Reliable  computation  of  FQ  and  voicing/unvoicing 
decisions. 

2)  Improved  spectral  tracking  using  pitch  synchronous 
analysis-synthesis  schemes. 

3)  Non-impulse  excitations  incorporating  glottal  events 
such  as  closure,  opening,  etc. 

The  Electroglottograph  (EGG)  signal  was  used  as  a 
glottal  sensor  to  aid  in  the  three  improvements  mentioned 
above.  The  characteristics  of  the  EGG  signal  and  its 
relation  to  the  glottal  vibratory  cycle  were  discussed. 
Algorithms  to  compute  FQ,  V/UV  decisions  and  to  perform 
pitch  synchronous  LP  analysis  and  synthesis  were  developed. 

Three  excitation  signals  were  generated  using  glottal 
information  derived  from  the  EGG  signal.  Sentences  for 
male,  female  and  child  subjects  were  analyzed  and 
resynthesized  using  the  LP  model.  The  EGG  signal,  time 
synchronized  with  the  speech  signal,  was  used  to  guide  the 
analysis-synthesis  procedure  and  to  generate  synthesis 
excitation  signals. 
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The  synthesized  sentences  were  evaluated  by  a total  of 
twenty  listeners  in  a formal  listening  test.  The  results  of 
subjective  evaluation  and  variability  in  naturalness  ratings 
among  listeners  and  among  different  listening  conditions 
were  discussed. 


9 . 2 Conclusions 

1)  A major  contribution  of  this  study  was  the  improvement 
in  speech  synthesis  achieved  by  the  use  of  the  EGG 
signal  time  aligned  with  the  speech  signal.  The 
errors  in  computing  FQ  and  V/UV  decisions  from  the 
speech  signal  alone  were  almost  completely  eliminated 
with  the  use  of  the  EGG  signal. 

2)  Synthesis  was  better  with  pitch  synchronous  schemes 
than  with  fixed  frame  schemes.  The  use  of  the  EGG 
signal  to  guide  the  pitch  synchronous  scheme  to 
perform  synthesis  is  a novel  scheme,  not  reported 
elsewhere,  although  analysis  schemes  guided  by  the  EGG 
signal  have  been  studied  (Gish,  1981;  Kr ishnamurthy , 
1983). 

3)  Generating  the  synthesis  excitation  waveform  with  the 
help  of  the  EGG  signal  also  resulted  in  better 
synthesized  speech. 

4)  The  average  response  of  the  listeners  to  one  original 
and  three  synthesized  sentences  were  in  the  following 


order. 
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i)  Original  speech. 

ii)  Pitch  synchronous  analysis  over  one  period  and 

synthesis  with  differentiated  EGG  excitation. 

iii)  Pitch  synchronous  analysis  over  one  period  and 

synthesis  with  Fant's  excitation. 

iv)  Fixed  frame  analysis  and  synthesis  with  impulse 
excitation. 

Synthesis  with  closed  phase  analysis  resulted  in  less 
natural  speech  than  synthesis  with  analysis  over  one 
whole  pitch  period.  This  was  true  for  any  of  the 
excitations  used. 

5)  Listeners  reported  buzziness  for  the  child's  voice 
with  high  FQ,  particularly  when  headphones  were  used 
for  listening.  Since  the  original  sentence  was  also 
marked  sometimes  as  buzzy,  it  may  be  that  the  very 
high  Fq  resulted  in  a buzzy  sound.  A more  exhaustive 
study  of  high  fq  voices  is  needed  to  resolve  this 
problem. 

6)  The  EGG  signal  can  be  easily  obtained  for  male,  female 
and  child  subjects.  But  it  is  essential  to  have  a 
steady  and  noise  free  EGG  signal  for  reliable  glottal 
sensing. 

7)  From  this  study  the  following  parameters  were  found  to 
contribute  to  improvements  in  synthesis. 
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a)  Reliable  Fo  and  V/UV  decisions,  obtained  via  the 
EGG  signal. 

b)  Non-impulse  excitations  that  incorporate  specific 
glottal  events. 

c)  Pitch  synchronous  analysis-synthesis  to  represent 
rapid  changes  in  the  speech  spectrum  more 
accurately. 

d)  Preserving  the  dynamic  changes  in  FQ  (jitter)  and 
changes  in  speech  intensity  from  period  to 
period . 

8)  A lack  of  representation  of  the  variability  in  natural 
speech  makes  measurement  of  naturalness  a very  elusive 
goal.  The  distance  measures  have  been  used,  with  some 
success,  in  speech  recognition  and  speaker 

verification  (Rabiner  and  Levinson,  1981).  These  two 
tasks  involve  spectral  matching  for  a small  set  of 
templates.  But  in  synthesis  two  identical  spectral 
models  produce  different  naturalness  ratings  depending 
on  the  excitation  signal.  Further,  the  same 

synthesized  sentence  can  be  judged  by  a group  of 
listeners  very  differently.  This  makes  a standardized 
measure  of  naturalness  unattainable  at  this  point. 

The  EGG  signal  cannot  always  be  acquired  with  speech. 
For  example,  in  a vocoder  application  using  telephone 
lines.  This  does  not,  however,  diminish  the  usefulness  of 
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the  EGG  signal  in  speech  research.  It  can  be  used  to  study 
different  modifications  of  the  LP  synthesizer.  Since  the 
excitation  parameters  can  be  obtained  reliably,  voicing 
errors  can  be  eliminated  and  pitch  synchronous  analysis  can 
be  efficiently  performed,  one  can  study  the  finer  details  of 
synthesis.  Storing  parametric  speech  data  for  automatic 
voice  reponse  can  be  performed  more  reliably,  thus  resulting 
in  better  synthesis. 

The  LP  analysis-synthesis  system  used  in  this  study  can 
be  employed  to  investigate  pathological  voices.  The 
dysfunction  present  in  the  laryngeal  vibratory  pattern  can 
be  represented,  to  a large  extent,  by  the  source  excitation 
derived  from  the  EGG  signal.  Pitch  synchronous  analysis  can 
be  performed  to  achieve  better  spectral  tracking.  The 
relative  contributions  of  the  source  and  the  vocal  transfer 
function  to  pathology  can  be  studied.  Quantitative  measures 
for  studying  speech  pathology  have  important  clinical 
applications.  Measures  based  on  LP  spectrum,  roots  of  the 
LP  inverse  filter  polynomial,  etc.  have  been  studied.  But 
how  effective  are  these  measures?  Is  the  pathology  still 
preserved  by  the  modeling  process?  One  can  study  these 
issues  by  systematically  synthesizing  pathological  voices, 
using  the  analysis-synthesis  system  employed  in  our  study. 


APPENDIX  A 

PROGRAMS  FOR  LP  ANALYSIS  AND  SYNTHESIS 


The  LP  analysis  and  synthesis  software  was  developed  in 
FORTRAN  on  a Data  General  Corporation  NOVA-4  computer.  The 
names  of  frequently  used  programs  for  data  collection,  pitch 
computation,  LP  analysis  and  synthesis,  etc.  are  given 
below. 

Speech  and  EGG  data  collection 


N4ADC 

Two  channel  A to  D routine 

JDACEDIT 

Selects  segments  of  digitized  data 

TPCOR 

Tape  recorder  correction 

JTREND 

Removes  trends  in  EGG  data. 

Pitch  computation 


JPITCONT 

Computes  pitch  contour  using  the 

Autocorrelation  method  and  the  SIFT  algorithm. 


JEGGPITCHl  - 

Computes  pitch  contour  from  the  aligned  EGG 

data. 

179 


180 


LP  analysis  and  synthesis 


JLAP 

- Fixed  frame  LP  analysis  program. 

JLAP2 

- Pitch  synchronous  LP  analysis  program. 

JLSP 

- Fixed  frame  LP  synthesis  program. 

JLSPl 

- Pitch  synchronous  LP  synthesis  program. 

Excitation  modeling 


GLOTEX2 

- Generates  different  excitation  signals  for  the 

synthesizer,  from  the  pitch  contour. 

Distance  measures 


SPDM 

Computes  various  distance  measures  between  two 

sets  of  speech  data. 

APPENDIX  B 

SET  UP  FOR  LISTENER  EVALUATION 


Some  of  the  details  of  the  listener  evaluation  session 
were  given  in  Chapter  8.  The  evaluation  form  along  with  the 
instruction  sheet  are  included  in  this  Appendix.  The 
details  of  the  set  up  are  summarized  below. 

a ) Speech  mater ial  — Original  and  synthesized 

sentences  for  male,  female  and 
child  speakers,  recorded  and 
played  back  on  a Revox  A-77  tape 
recorder. 

b)  Listener  group  - 1)  Twelve  senior  level  students 

and  one  instructor  from  the 
Speech  Department. 

2)  Six  graduate  students  and 
one  professor  from  the 
electrical  engineering 
department  engaged  in  speech 
research . 


181 


182 


c)  Listening  conditions  - 1)  Quiet  room  and  two  infinity 

loud-speakers. 

2)  Sound  treated  room  and  KOSS 
PRO/4X  headphones. 

d)  Test  procedure  - Forced  choice  type.  Select  the 

more  natural  sounding  sentence 
in  each  pair  of  sentences 
presented  ( A, B ) . 
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This  is  a Listener  Evaluation  Session.  it  is  designed 
to  evaluate  the  performance  of  a Speech  Synthesis  scheme. 

You  will  be  presented  with  a series  of  sentence  pairs 
(A,B).  Your  task  is  to  judge  which  sentence  in  each  pair 

sounds  more  "natural"  than  the  other  and  circle  the 
corresponding  entry  on  the  evaluation  sheet. 

The  presentation  sequence  is  as  follows. 

1)  A short  tone  for  cue. 

2)  Each  sentence  pair  repeated  twice  (A,B),  (A,B). 

3)  A 4-second  silence  for  marking  your  choice. 

The  above  sequence  will  be  repeated  for  each  pair. 

i)  Please  mark  your  choice  only  after  each  pair  has 
been  presented  twice. 

ii)  Make  your  judgements  despite  differences  in  the 
intensity  levels  between  the  two  sentences  in  a pair. 

iii)  Please  provide  a short  comment  at  the  end  of  the 
first  session  on  what  criteria  you  used  to  make  your 
decision  on  "naturalness." 
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LISTENER  EVALUATION  FORM 


Task  I Task  II  Task  III 


1) 

A 

B 

1) 

A 

B 

1) 

A 

B 

2) 

A 

B 

2) 

A 

B 

2) 

A 

B 

3) 

A 

B 

3) 

A 

B 

3) 

A 

B 

4) 

A 

B 

4) 

A 

B 

4) 

A 

B 

5) 

A 

B 

5) 

A 

B 

5) 

A 

B 

6) 

A 

B 

6) 

A 

B 

6) 

A 

B 

7) 

A 

B 

7) 

A 

B 

7) 

A 

B 

8) 

A 

B 

8) 

A 

B 

8) 

A 

B 

9) 

A 

B 

9) 

A 

B 

9) 

A 

B 

10) 

A 

B 

10) 

A 

B 

10) 

A 

B 

11) 

A 

B 

ID 

A 

B 

ID 

A 

B 

12) 

A 

B 

12) 

A 

B 

12) 

A 

B 

13) 

A 

B 

13) 

A 

B 

13) 

A 

B 

14) 

A 

B 

14) 

A 

B 

14) 

A 

B 

15) 

A 

B 

15) 

A 

B 

15) 

A 

B 

16) 

A 

B 

16) 

A 

B 

16) 

A 

B 

17) 

A 

B 

17) 

A 

B 

17) 

A 

B 

18) 

A 

B 

18) 

A 

B 

18) 

A 

B 

19) 

A 

B 

19) 

A 

B 

19) 

A 

B 

20) 

A 

B 

20) 

A 

B 

20) 

A 

B 

21) 

A 

B 

21) 

A 

B 

21) 

A 

B 

22) 

A 

B 

22) 

A 

B 

22) 

A 

B 

23) 

A 

B 

23) 

A 

B 

23) 

A 

B 

24) 

A 

B 

24) 

A 

B 

24) 

A 

B 

Name 


Comments 
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LISTENER  EVALUATION  FORM 


TASK  VI 


1) 

A 

B 

2) 

A 

B 

3) 

A 

B 

4) 

A 

B 

5) 

A 

B 

6) 

A 

B 

7) 

A 

B 

8) 

A 

B 

9) 

A 

B 

10) 

A 

B 

ID 

A 

B 

12) 

A 

B 

Name 
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